WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection

Cao, Wei; Zhao, Xinyu; Wang, Hongqi; Hu, Yuxin

doi:10.3390/rs17091570

Open AccessArticle

WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection

by

Wei Cao

^1,2,3

,

Xinyu Zhao

^1,2,3,

Hongqi Wang

^1,2,3 and

Yuxin Hu

^1,2,3,*

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Target Cognition and Application Technology (TCAT), Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1570; https://doi.org/10.3390/rs17091570

Submission received: 26 February 2025 / Revised: 14 April 2025 / Accepted: 24 April 2025 / Published: 28 April 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Fine-grained ship detection tasks require models to accurately classify fine-grained categories and precisely localize them within complex backgrounds, relying on detailed features. The challenges of this task mainly lie in bird’s-eye viewpoints, scale variations, rotational changes, and environmental factors, which lead to minor inter-class differences and significant intra-class variations. This paper presents a novel model, called Wavelet Transform-based Dual-Stream Backbone Network (WTDBNet), which effectively integrates three key strengths: the ability of the Transformer to model long-range dependencies for global context, the capability of convolutional neural networks to extract detailed local features, and the efficiency of wavelet transform in frequency-domain decomposition for enhancing edges and texture details. These components are fused via channel and spatial attention mechanisms, thereby improving the model’s ability to extract discriminative features. The effectiveness of WTDBNet is validated on two widely used benchmarks for fine-grained oriented ship detection, as well as on a self-constructed dataset designed to represent complex scenarios. Experimental results demonstrate the superior performance of the proposed method.

Keywords:

deep learning; remote sensing images; fine-grained ship detection; CNN; Transformer; wavelet transform

1. Introduction

Recent advancements in remote sensing technology have facilitated large-scale ocean monitoring, making ship detection a prominent research focus in the field of remote sensing image analysis. The objective of ship detection is to automatically recognize and locate ships in images, playing a significant role in various fields, including ocean monitoring, maritime management, search and rescue operations, and security defense.

Deep learning approaches have led to significant improvements in object detection for natural images. Convolutional Neural Networks (CNNs) have exhibited exceptional efficacy in a variety of detection tasks due to their robust feature extraction capabilities. Representative models include the two-stage R-CNN family [1,2,3,4,5,6] and the one-stage YOLO family [7,8,9,10,11]. Recently, Transformer architectures [12] have gained traction due to their global modeling capabilities, which allow for better capture of long-range dependencies. Models such as Swin Transformer [13] and DETR [14] have introduced new paradigms for object detection. However, remote sensing object detection poses difficulties owing to the extensive imaging range and intricate backgrounds. Objects are typically small, densely arranged, and oriented at arbitrary angles. Traditional object detection methods struggle to adapt to these scenarios. To address this, methods employing oriented bounding boxes (OBB) [15,16,17,18,19] have been proposed, enabling the precise localization of arbitrarily oriented objects. This approach significantly enhances detection accuracy and represents a breakthrough in remote sensing object detection.

As shown in Figure 1, most current remote sensing object detectors are still limited to coarse-grained category detection [15,16,17,18,19], such as identifying broad categories like airplanes, ships, vehicles, and bridges. These detectors struggle to distinguish between fine-grained subclasses, such as the Arleigh Burke-class destroyer or the Ticonderoga-class cruiser. Recently, the resolution of remote sensing imagery has significantly improved. For example, the JL-1 satellite achieves a resolution of 0.3 m, enabling the precise observation of ship structures and texture details, which provides a robust data foundation for fine-grained object detection. This advancement highlights the urgent need for more effective detection models that not only accurately locate remote sensing object but also precisely differentiate among various subclasses.

However, compared to coarse-grained object detection, fine-grained ship detection presents significantly greater challenges. As shown in Figure 2, first, many ship categories exhibit minimal inter-class variation. These categories share highly similar appearances, shapes, and colors. Distinctions between different types of ships often lie in subtle local features, such as the curvature of the hull, the structure of the superstructure, or the texture of the bow and stern. Considering that remote sensing images are typically captured by satellites or aircraft from a bird’s-eye view, it is not easy to distinguish similar ships based solely on global features. Instead, local features and fine details must be emphasized, yet these are often difficult to extract effectively. Second, there may also be significant intra-class variations. Even within the same ship category, factors such as varying rotation angles, lighting conditions, or the presence of clouds and fog can result in substantial image differences. Fine-grained ship detection models must tolerate such variations while still accurately identifying the ship’s category.

To solve these issues, methods such as those used in [20,21,22] have focused on optimizing the region proposal branch or detection head to improve fine-grained detection performance. However, few methods have investigated how to effectively integrate spatial-domain and frequency-domain information, and fully exploit both global and local features to strengthen the backbone’s feature extraction capability, thereby improving fine-grained detection accuracy.

This paper presents a novel model, called Wavelet Transform-based Dual-Stream Backbone Network (WTDBNet), for fine-grained ship detection. The model’s backbone network comprises two parallel branches: one combines CNN and wavelet transform, and the other is based on the Swin Transformer. The input image is first decomposed through the wavelet transform into low-frequency components, which are processed through adaptive image enhancement, and high-frequency components, which are subjected to denoising operation. These frequency-domain features are then fused with spatial features and fed into the CNN branch. Channel and spatial attention mechanisms are used to fuse the features extracted from CNN branch and Transformer branches at the same level, and multi-scale feature fusion at different levels is performed using a Feature Pyramid Network (FPN) [23]. This process enhances the model’s capacity to extract features from various domains and at various scales, ultimately improving fine-grained ship detection accuracy.

The primary contributions of this work can be summarized as follows:

A novel WTFusion Block is introduced, which integrates wavelet transform, traditional image processing techniques, and CNN to extract rich local features and fine-grained details, thereby enhancing the model’s robustness in detecting ships under complex scenarios.
A dual-stream backbone network, WTDBNet, which combines wavelet transform, CNN, and Transformer, is designed to effectively extract and integrate both global and local features, as well as spatial- and frequency-domain information, thereby significantly improving fine-grained ship detection accuracy.
The effectiveness of WTDBNet is validated on two widely used benchmarks for fine-grained oriented ship detection, as well as on a self-constructed dataset designed to represent complex scenarios. Experimental results demonstrate the superior performance of the proposed method.

The remaining sections of this article are organized as follows. A brief summary of related work is provided in Section 2. The materials and proposed approach are described in depth in Section 3. The experimental results are thoroughly analyzed in Section 4. Section 5 discusses the strengths and limitations of the work, and Section 6 concludes this article.

2. Related Work

2.1. Oriented Object Detection

Remote sensing objects are often characterized by small sizes, dense distributions, and diverse orientations. To address the issues of redundant region coverage and inaccurate localization caused by horizontal bounding boxes, oriented bounding boxes are often employed in remote sensing object detection. These bounding boxes help reduce feature misalignment and enhance localization accuracy. The introduction of datasets annotated with oriented bounding boxes [24,25,26,27,28] has significantly advanced research in oriented object detection. For instance, Ding et al. [15] proposed RoI Transformer, which learns to convert horizontal regions of interest (HROIs) into rotated regions of interest (RRoIs) under the supervision of oriented bounding boxes, thereby improving detection accuracy. Xu et al. [16] introduced Gliding Vertex, which represents oriented objects using four variables, gliding the vertices along the edges of a horizontal bounding box. Additionally, an area ratio factor was incorporated. These five variables were integrated into Faster R-CNN to detect oriented objects. Han et al. proposed S²A-Net [17], which employs two modules to achieve feature alignment between horizontal receptive fields and rotated anchors, mitigating inconsistencies between regression and classification tasks. Xie et al. introduced Oriented R-CNN [18], which uses an oriented RPN to generate high-quality oriented proposals directly and efficiently, balancing excellent accuracy with high speed. These methods primarily focus on detecting coarse-grained categories of remote sensing objects. However, they lack specific optimizations for fine-grained features, which are crucial for distinguishing subcategories.

2.2. Fine-Grained Object Detection

With the rapid development of remote sensing technology and improvements in sensor resolution, fine-grained object detection datasets such as HRSC2016 [25], ShipRSImageNet [26], MAR20 [27], and FAIR1M [28] have been released. These datasets provide researchers with rich annotated data, significantly advancing the research and application of fine-grained object detection tasks. Cheng et al. [20] proposed SFRNet, which consists of two Transformer-based branches. One branch captures key discriminative information for fine-grained classification using spatial and channel Transformers, while the other combines oriented response convolution with Transformer to assist in rotated localization. Ouyang et al. [21] proposed MGANet, which first extracts rotation-invariant features and then employs a dual-feature extraction network to separately capture global and local features, followed by feature alignment and fusion to enhance fine-grained features representation. Li et al. [22] proposed PETDet, which includes an anchor-free quality-oriented proposal network and a bilinear channel fusion network to enhances proposal generation and the discriminability of features. These methods primarily utilize additional modules to process fine-grained features after the backbone. However, they rarely explore integrating traditional image processing techniques with deep learning to enhance the backbone’s capacity to extract richer features, which may provide another effective solution for fine-grained object detection.

2.3. Transformers in Computer Vision

For a long time, CNNs have dominated the field of computer vision. Meanwhile, the prevailing architecture in natural language processing (NLP) is the Transformer [12]. Transformers have demonstrated significant success in the NLP domain. Dosovitskiy et al. [29] proposed the Vision Transformer (ViT), which introduced the Transformer architecture to image classification tasks for the first time, yielding promising results. This method broke the reliance on convolutional structures in computer vision, paving the way for new research directions. However, ViT struggles with multi-scale object representation and computational complexity. Liu et al. [13] proposed the Swin Transformer model to address these issues. The model partitions an input image into a lot of non-overlapping local windows and computes self-attention independently within each window to reduce computational complexity. By integrating cross-window information and employing multi-level feature representations, the Swin Transformer facilitates global information exchange across windows and enables multi-scale feature extraction. The Swin Transformer has demonstrated outstanding performance in various visual tasks, establishing itself as a new general-purpose backbone network in computer vision.

2.4. Wavelet Transform in Computer Vision

Wavelet Transform (WT) [30] is a classical signal processing technique introduced in the 1980s. It has been widely utilized in areas such as signal processing, feature extraction, noise reduction, and image compression. In recent years, WT has gained attention in the deep learning domain due to its unique advantages in time-frequency analysis. Liu et al. [31] incorporated WT into U-Net, using it for downsampling in the contracting subnetwork and inverse wavelet transform for upsampling in the expanding subnetwork. This method has achieved significant improvements in image denoising, super-resolution, and artifact removal. Xu et al. [32] introduced a Haar wavelet-based pooling module as an alternative to max pooling, average pooling, or stride convolution, aiming to preserve more key information during downsampling, thereby enhancing the performance of semantic segmentation. Finder et al. [33] proposed Wavelet Transform Convolution (WTConv), an alternative to traditional deep convolution, which increases the receptive field of convolutions, thereby enhancing CNN performance. Wu et al. [34] proposed WDFA-YOLOX for ship detection in SAR images. The model integrates a wavelet transform-based SPP module into the backbone to reduce the loss of fine-grained feature information during the pooling process. A Global and Local Feature Attention Enhancement module minimizes the interference of irrelevant information, improving ship detection accuracy. These studies suggest that the integration of wavelet transform and deep learning holds significant potential, particularly in enhancing the extraction of local fine-grained features. Therefore, this paper seeks to leverage a wavelet transform-based deep learning model to capture these detailed features, thereby enabling the model to be more robust in complex environments.

3. Materials and Methods

3.1. Overall Framework

The proposed network WTDBNet, which integrates the merits of wavelet transform, CNN, and Transformer, is built on the baseline model, Oriented R-CNN. The backbone consists of two feature extraction branches: one based on CNN, augmented with wavelet transform (WTCNN), and the other on Swin Transformer. The lateral connections at each layer fuse feature maps of the same spatial dimensions from the WTCNN and Swin Transformer branches using channel and spatial attention mechanisms. The fused multi-scale features are further processed by FPN to generate five levels of features, which are subsequently input into the oriented RPN to extract rotated regions of interest (RRoIs). Next, the RRoIs and the feature maps are fed into the Oriented R-CNN head to extract fixed-size feature vectors. Finally, these vectors are passed through fully connected layers to generate N+1 classification results (N object categories and 1 background category) and the corresponding bounding box regression results. The overall architecture of the model is depicted in Figure 3.

3.2. Two-Dimensional Discrete Wavelet Transform

The 2D Discrete Wavelet Transform decomposes an image into low-frequency components that represent global structural information and high-frequency components that capture edges, textures, and other fine details. This decomposition provides rich multi-scale feature information, which is essential for fine-grained analysis. Specifically, given an input image X, as defined in Equation (1), DWT produces one low-frequency component A and three high-frequency components—H, V, and D—corresponding to the horizontal, vertical, and diagonal directions, respectively. Consequently, each input channel yields four output channels, each with a spatial resolution reduced by half compared to the original image.

A, H, V, D = D W T (X)

(1)

In remote sensing images, ships may appear in arbitrary orientations. Thus, we further enhance the integration of high-frequency information by fusing H, V, and D to form a composite high-frequency component, denoted as

s u m H

. The formula is defined as follows:

s u m H = H + V + D

(2)

As illustrated in Figure 4, this decomposition and fusion process not only reduces the image resolution—achieving an effect similar to strided convolution or downsampling—but also enables the model to extract more detailed features. This provides a robust foundation for subsequent object detection tasks.

3.3. Wavelet Transform-Based CNN Branch

The wavelet transform-based CNN (WTCNN) branch extracts features not only from the spatial domain of the original image but also from the frequency domain via wavelet transform. As illustrated in Figure 5, the wavelet transform is first applied to decompose an input image X into one low-frequency component A and three high-frequency components H, V, and D. Component A primarily captures the overall structure and color information. To enhance robustness in feature extraction under varying illumination conditions, an adaptive image enhancement algorithm [35] is applied to adjust its brightness and contrast. In contrast, components H, V, and D contain crucial edge and texture details but also introduce significant noise. To mitigate this, a soft-thresholding denoising method is employed, which improves the discriminative capability of fine-grained details.

As described in Section 3.2, after completing these preprocessing steps, the high-frequency components H, V, and D are directly summed to form the combined high-frequency component

s u m H

. These components are then fed into a shared ResNet50 [36] module group to extract deeper features. The extracted frequency-domain features are then fused with the downsampled spatial-domain features from the original image. Specifically, a WTFusion Block is designed to perform dimensionality reduction and cross-channel information interaction, as shown in Figure 6. This Block jointly leverages spatial-domain information and frequency-domain information. This multi-perspective feature extraction and fusion strategy enhances the model’s ability to capture complex image features, thereby improving its robustness under varying conditions.

Finally, the network applies four stacked ResNet Bottleneck modules to perform deeper feature extraction, ultimately outputting the final feature representation. The entire process is formulated as follows:

F_{o r g} = M a x P o o l (R e L U (B N (C o n v 7 \times 7 (X))))

(3)

F_{A} = R e L U (B N (C o n v 7 \times 7 (E n h a n c e (A))))

(4)

H_{d}, V_{d}, D_{d} = D e n o i s e (H, V, D)

(5)

s u m H = H_{d} + V_{d} + D_{d}

(6)

F_{s u m H} = R e L U (B N (C o n v 7 \times 7 (s u m H)))

(7)

F_{c n n - f u s i o n} = R e L U (B N (C o n v 1 \times 1 (c o n c a t (F_{o r g}, F_{A}, F_{s u m H}))))

(8)

F_{o u t} = B o t t l e n e c k^{4} (F_{c n n - f u s i o n})

(9)

3.4. Transformer–CNN Fusion Backbone Network

The dual-stream backbone consists of a Swin Transformer branch and a WTCNN branch, as illustrated in Figure 7. The Swin Transformer branch focuses on the overall structure and semantic features, while the WTCNN branch focuses more on the extraction of local details such as textures and edges. Finally, the model employs both channel and spatial attention to fuse features from the two branches, thereby improving the comprehensive modeling of global and local information. This fusion strategy provides richer feature representations for fine-grained object detection tasks. The previous section introduced the WTCNN branch; this section briefly presents the Swin Transformer branch and the fusion strategy.

The Swin Transformer branch first performs patch partitioning on the input image X, dividing each channel into

S \times S

non-overlapping patches, where each patch has a size of

T \times T

pixels. Consequently, the entire image is decomposed into

3 \times S \times S

patches. These patches are then mapped into feature representations of dimension C through a linear embedding layer. Subsequently, multiple stacked standard Swin Transformer blocks extract hierarchical features, generating feature representations that maintain the same structural format as the input vector.

Within each Swin Transformer block, the patches are further divided into

N \times N

local windows, with each window containing

M \times M

patches. The Multi-Head Self-Attention (MHSA) mechanism is then applied within these local windows to effectively model local contextual information. In each window, the input features are transformed through a linear operation to generate the Matrix Query (Q), Matrix Key (K), and Matrix Value (V) matrices. The self-attention computation formula is as follows:

A t t (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

where the scaling factor

d_{k}

is the dimension of the Key, ensuring numerical stability in the dot-product attention calculation.

To further enhance semantic feature extraction, a patch merging module is applied to downsample the feature representations between two consecutive Swin Transformer blocks.

After feature extraction by both the Swin Transformer and WTCNN branches, the model uses a CBAM [37] module to fuse features at the same level from each branch at each stage. These fused features are then fed into the FPN to further extract multi-scale features across different levels. This feature fusion strategy enables the model to comprehensively capture both global and local features, spatial and frequency domain information, as well as low- and high-level semantic cues, which is beneficial for addressing the challenges of ship recognition and localization in complex backgrounds.

3.5. Datasets

The performance of our method was evaluated on two publicly available fine-grained oriented ship detection datasets, HRSC2016 [25] and ShipRSImageNet [26], as well as on a self-constructed dataset, HRSC2016-A.

(1) The HRSC2016 dataset is the first large-scale fine-grained classification optical remote sensing image dataset for ship detection. The samples in this dataset are sourced from images of six well-known ports on Google Earth, covering both offshore and nearshore scenes. The dataset is organized into three levels: ship classes, ship categories, and ship types. It contains 1070 images across 31 types (the statistics provided here are based on the actual downloaded dataset and may differ slightly from the original publication). Each sample in the dataset provides detailed annotation information, including image metadata such as port, data source, image date, geographic coordinates, resolution level, and scale, as well as annotation details such as horizontal bounding boxes, rotated bounding boxes, and ship bow positions (for all ships with “V”-shaped bows). The dataset mainly consists of large instances (>9216 pixels) and medium instances (1024–9216 pixels), which account for 75.2% and 24.7% of the total, respectively. There are almost no annotations for small instances (<1024 pixels). Figure 8 shows the samples and distribution of instance sizes. The dataset consists of 626 training images and 444 testing images.

(2) ShipRSImageNet is a large-scale fine-grained optical remote sensing image dataset designed for ship detection. It comprises samples collected from various sensor platforms, including the Gaofen-2 and JL-1 satellites, and integrates data from multiple sources, such as xView [38], HRSC2016 [25], and FGSD [39]. The dataset exhibits significant diversity in ship size, orientation, aspect ratio, spatial resolution, and environmental conditions. It contains more than 3435 images, with annotations provided for both horizontal and rotated bounding boxes. In addition, detailed metadata are available for each sample, including ship orientation, marine environment, geographic location, and spatial resolution. Each image has dimensions of approximately 930 × 930 pixels, with a maximum width of 1238 pixels and a maximum height of 930 pixels. The dataset maintains a relatively balanced distribution of large (40%), medium (36%), and small (24%) objects. Figure 9 illustrates sample images and the distribution of instance sizes. The dataset includes four levels and 50 fine-grained categories. It is split into 2198 training images, 550 validation images, and 687 testing images. Since the test set does not include ground-truth bounding boxes, all evaluations are conducted on the validation set.

(3) The HRSC2016-A dataset is an adjusted version of the HRSC2016 dataset. As shown in Figure 10, we randomly selected one-quarter of the images from the test set (111 images). Half of these images (55 images) were adjusted in terms of illumination to simulate night conditions, while the other half (56 images) were downsampled or subjected to Gaussian blur to simulate varying sensor performances. This dataset was used to validate the robustness of the proposed method under varying lighting conditions and sensor resolutions.

4. Results

4.1. Experimental Settings

We conduct experiments on a Tesla V100 GPU with 32 GB memory. The code was implemented based on the MMRotate framework [40]. The Transformer branch was based on the pre-trained Swin Transformer Tiny (Swin-T) model on ImageNet. We resized the input image to 1024 × 1024 pixels, and then divided it into 256 × 256 patches, with each patch having a size of 4 × 4 pixels. The linear embedding layer’s dimension is 96, which is implemented using a 4 × 4 convolutional layer with 3 input channels, 96 output channels, and a stride of 4. The local window has a dimension of 7 × 7, and the number of Swin Transformer modules in each stage is set to 2, 2, 6, and 2, respectively. Here, “Merge n × n” refers to the patch merging of n × n neighboring features within a patch. The CNN branch employed the ResNet50 model, with the number of modules in each stage configured as 3, 4, 6, and 3. The Daubechies-2 (db2) wavelet was chosen as the basis function for the wavelet transform. Table 1 shows the specific model parameters.

During training, we used a 3× learning schedule with 36 epochs. AdamW was employed as the optimizer, with the initial learning rate set to 0.0001, the weight decay to 0.05, and the momentum parameters to (0.9, 0.999). For experiments on the HRSC2016 and ShipRSImageNet datasets, random flipping was applied as a data augmentation strategy. To ensure a fair comparison, additional random photometric distortion was introduced for the experiments on the HRSC2016-A dataset to enhance the generalization ability of both the comparison models and the proposed model.

4.2. Evaluation Metrics

This paper employs Average Precision (AP), a comprehensive measure of the model’s detection capability, as the primary evaluation metric. AP is calculated as the average of precision values at multiple Intersection over Union (IoU) thresholds, ranging from 0.50 to 0.95 in steps of 0.05. In addition,

A P_{50}

and

A P_{75}

are reported as supplementary metrics.

To provide a more comprehensive evaluation of the model’s overall performance, we assessed both inference speed and network complexity. Inference speed is measured by frames per second (FPS), while network complexity is quantified by the number of parameters (Param.).

4.3. Comparative Experiments

To evaluate the effectiveness and robustness of the proposed method, this section compares WTDBNet with four mainstream oriented object detection approaches, including Gliding Vertex, S²ANet, RoI Transformer, and Oriented R-CNN, on three datasets: HRSC2016, ShipRSImageNet, and HRSC2016-A.

As shown in Table 2, the experimental results on the HRSC2016 dataset demonstrate that the proposed method achieves the highest mAP of 53.95%, with improvements of 31.38, 32.08, 29.21, and 3.32 percentage points over the compared methods, respectively. Similarly, the results on the ShipRSImageNet dataset, as shown in Table 3, indicate that our method achieves the best mAP of 55.87%, outperforming the other methods by 25.45, 25.32, 17.47, and 4.04 percentage points, respectively. On the HRSC2016-A dataset, as shown in Table 4, the proposed method again achieves the highest mAP of 48.16%, with improvements of 28.16, 27.99, 27.11, and 4.52 percentage points over the baselines. Moreover, consistent improvements in

A P_{50}

and

A P_{75}

are also observed across all three datasets.

The experimental results indicate that traditional oriented object detection algorithms still have considerable room for improvement in detection accuracy, primarily due to their lack of consideration for the specific characteristics of fine-grained ship detection and their limited adaptability to certain challenging scenarios such as varying illumination and image resolution. In contrast, the proposed method achieves superior performance by jointly modeling global context and local detail features, and effectively integrating wavelet transform with traditional image processing techniques. These advantages contribute to its state-of-the-art performance and strong robustness across multiple datasets.

4.4. Ablation Experiment

Ablation studies were conducted on three datasets—HRSC2016, ShipRSImageNet, and HRSC2016-A—to validate each component’s contribution in the proposed method.

Table 5 presents the ablation results for the HRSC2016 dataset. Compared with the baseline model, Oriented R-CNN (ResNet50), introducing the WTFusion Block significantly boosts the mAP from 50.63% to 53.26%, achieving an improvement of 2.63 percentage points. Additionally,

A P_{50}

and

A P_{75}

are increased by 1.9 and 2.2 percentage points, respectively, demonstrating that the wavelet transform enhances the extraction of fine-grained image features, thereby improving ship detection performance. Replacing the backbone with Swin Transformer leads to a further mAP increase of 1.56 percentage points, indicating that, compared with ResNet, Swin Transformer is more effective in capturing visual representations through its hierarchical and cross-window attention mechanisms. Finally, by employing the dual-stream backbone network WTDBNet, which integrates wavelet transform, CNN, and Transformer, the model achieves the best detection performance with mAP,

A P_{50}

and

A P_{75}

, reaching 53.95%, 74.1%, and 67.53%, respectively. These represent improvements of 3.32, 2.7, and 4.13 percentage points over the baseline. These results validate that the synergistic integration of wavelet transform, CNN, and Transformer can significantly enhance feature representation capabilities, making the proposed method particularly advantageous for fine-grained ship detection tasks.

We further validated the effectiveness of the proposed method on the ShipRSImageNet dataset, and the results are presented in Table 6. Compared with the baseline model, the introduction of the WTCNN module led to a 1.15 percentage points improvement in mAP. Replacing the backbone with Swin Transformer further increased the mAP by 1.89 percentage points. When employing the WTDBNet, the model achieved mAP,

A P_{50}

and

A P_{75}

values of 55.87%, 75.11%, and 68.35%, respectively, which represent improvements of 4.04, 3.1, and 6.31 percentage points over the baseline model.

Finally, the ablation study results on the HRSC2016-A dataset are presented in Table 7. Compared with the baseline model, the WTCNN module improved the mAP by 2.33 percentage points, while replacing the backbone with the Swin Transformer led to a 2.21 percentage points gain. When adopting the WTDBNet, the model achieved an mAP of 48.16%,

A P_{50}

of 66.4%, and

A P_{75}

of 57.6%, which represent improvements of 4.52, 3.4, and 4.7 percentage points over the baseline, respectively. These results further demonstrate the robustness and generalization ability of each component of the proposed method under varying environmental conditions and imaging resolutions, offering a promising approach to enhancing ship detection performance in complex scenarios.

To more intuitively and clearly illustrate the contribution of each component of the proposed method to fine-grained ship detection accuracy, we compared the classification confusion matrices of the baseline model (ResNet50), the improved WTCNN model, the model using Swin Transformer as the backbone, and the final proposed WTDB model on the HRSC2016-A dataset, considering only categories with at least 10 instances, as shown in Figure 11.

As observed from the confusion matrices, the WTCNN model outperforms the baseline in overall accuracy, with particularly noticeable improvements in categories such as Perry (56% vs. 63%), Container (42% vs. 51%), and Submarine (79% vs. 88%). The model with the Swin Transformer backbone further improves the performance, demonstrating the superior capability of Transformer structures in capturing global contextual information. Among all models, WTDB achieves the best performance, with significantly improved classification accuracy across most categories. Notably, it exhibits considerable gains in classes such as Arleigh Burke (67% vs. 76%), Whidbey Island (43% vs. 57%), Perry (56% vs. 70%), and San Antonio (45% vs. 59%). These results validate the effectiveness of integrating wavelet transform, CNN, and Transformer for enhancing fine-grained feature representation in ship detection tasks.

From the above analysis, the visualized confusion matrix results clearly demonstrate the contribution of each component of the model to fine-grained object recognition. The proposed WTDB model not only maintains overall recognition stability but also effectively reduces misclassifications among highly similar categories, thereby exhibiting strong robustness and discriminative capability.

4.5. Visualization of Results

To further demonstrate the superiority of WTDBNet in fine-grained ship detection tasks, we visualize the detection results of each algorithm and provide a comparison. Figure 12 illustrates the visual comparisons between the ground truth, our method, and other mainstream methods on the HRSC2016-A dataset. Our method outperforms the others in both classification accuracy and localization precision, and exhibits strong robustness under varying environmental conditions and imaging resolutions.

5. Discussion

Although WTDBNet achieves the best detection accuracy on the HRSC2016, ShipRSImageNet, and HRSC2016-A datasets and demonstrates strong robustness and generalization under varying environmental conditions and imaging resolutions, the model still has certain limitations. Specifically, the improvement in detection precision comes at the cost of increased model complexity and inference time. On the HRSC2016 dataset, we evaluated the inference time and parameters of WTDBNet in comparison with other mainstream methods. As shown in Table 8, WTDBNet has more parameters and slower detection speed. The primary reason for this lies in the network architecture, which integrates wavelet transform modules and a dual-branch CNN–Transformer fusion structure via parallel processing during feature extraction. While this design enables the model to obtain richer fine-grained features, it also inevitably increases the computational burden.

To alleviate this problem, future work will concentrate on two main directions: (1) applying model compression techniques, such as pruning, to reduce the model size while preserving detection accuracy, and (2) designing a more efficient frequency-domain fusion strategy to alleviate the computational overhead introduced by the wavelet transform modules, thereby enhancing the real-time detection capability without sacrificing performance.

In addition, this study focuses on the detection of level-3 (L3) fine-grained ship detection, for which publicly available datasets are relatively limited. In subsequent research, we plan to conduct experiments on larger remote sensing datasets [28,41] to further validate the generalization ability and practical value of the proposed method.

6. Conclusions

This paper proposes a Wavelet Transform-Based Dual-Stream Backbone Network, WTDBNet, for fine-grained ship detection. By leveraging the strengths of Convolutional Neural Networks, Swin Transformer, and wavelet transform, WTDBNet effectively extracts richer features from both the spatial and frequency domains, as well as from global and local perspectives. Experimental results on two publicly available fine-grained oriented ship detection datasets and a self-constructed dataset demonstrate its superiority, effectively improving both fine-grained classification accuracy and object localization precision. Moreover, it exhibits strong robustness and generalization capability under varying environmental conditions and imaging resolutions, indicating its broad application potential. Future research can further explore the application of WTDBNet to various remote sensing image interpretation tasks, such as scene segmentation, change detection, and object tracking, thereby expanding its potential in intelligent remote sensing analysis. In addition, by incorporating techniques such as model pruning and lightweight network design, as well as developing more efficient frequency-domain feature fusion strategies, it is expected to enhance the inference efficiency while maintaining detection accuracy. This would enable the model to better meet the real-time performance and deployment flexibility requirements of practical application scenarios.

Author Contributions

Conceptualization, W.C. and Y.H.; Formal analysis, W.C. and H.W.; Investigation, W.C.; Methodology, W.C., X.Z. and Y.H.; Resources, W.C. and X.Z.; Software, W.C.; Supervision, X.Z., H.W. and Y.H.; Validation, W.C.; Visualization, W.C.; Writing—original draft, W.C.; Writing—review and editing, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This paper uses the HRSC2016 dataset [24] and ShipRSImageNet dataset [25]. Data sources: https://aistudio.baidu.com/datasetdetail/54106 (accessed on 27 September 2020) and https://github.com/zzndream/ShipRSImageNet (accessed on 14 August 2021).

Acknowledgments

The authors would like to thank the editor and the anonymous reviewers who gave constructive comments and helped to improve the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection withregion proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 December 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 89–95. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2849–2858. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.-S. Gliding vertex on the horizontal bounding box for multioriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602511. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3500–3509. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 3163–3171. [Google Scholar]
Cheng, G.; Li, Q.; Wang, G.; Xie, X.; Min, L.; Han, J. SFRNet: Fine-grained oriented object recognition via separate feature refinement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610510. [Google Scholar] [CrossRef]
Ouyang, L.; Fang, L.; Ji, X. Multigranularity self-attention network for fine-grained ship detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 9722–9732. [Google Scholar] [CrossRef]
Li, W.; Zhao, D.; Yuan, B.; Gao, Y.; Shi, Z. PETDet: Proposal Enhancement for Two-Stage Fine-Grained Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5602214. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; pp. 324–331. [Google Scholar]
Zhang, Z.; Zhang, L.; Wang, Y.; Feng, P.; He, R. ShipRSImageNet: A large-scale fine-grained dataset for ship detection in high-resolution optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 8458–8472. [Google Scholar] [CrossRef]
Yu, W.; Cheng, G.; Wang, M.; Yao, Y.; Xie, X.; Yao, X.; Han, J. MAR20: A benchmark for military aircraft recognition in remote sensing images. Nat. Remote Sens. Bull. 2022, 27, 2688–2696. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New York City, NY, USA, 23–26 June 2021; 2021; pp. 45–67. [Google Scholar]
Daubechies, I. Ten Lectures on Wavelets; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 886–895. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the European Conference on on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Wu, F.; Hu, T.; Xia, Y.; Ma, B.; Sarwar, S.; Zhang, C. WDFA-YOLOX: A Wavelet-Driven and Feature-Enhanced Attention YOLOX Network for Ship Detection in SAR Images. Remote Sens. 2024, 16, 1760. [Google Scholar] [CrossRef]
Ahn, H.; Keum, B.; Kim, D.; Lee, H. Adaptive local tone mapping based on retinex for high dynamic range images. In Proceedings of the 2013 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 11–14 January 2013; pp. 153–156. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Ulatov, Y.; McCord, B. xView: Objects in context in overhead imagery. arXiv 1802, arXiv:1802.07856. [Google Scholar]
Chen, K.; Wu, M.; Liu, J.; Zhang, C. FGSD: A dataset for finegrained ship detection in high resolution satellite images. arXiv 2003, arXiv:2003.06832. [Google Scholar]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar]
Wang, D.; Zhang, J.; Du, B.; Xu, M.; Liu, L.; Tao, D.; Zhang, L. SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]

Figure 1. General remote sensing object detection vs. fine-grained ship detection.

Figure 2. Instances of inter-class and intra-class variations under different perspectives, lighting conditions, and foggy conditions. (a) Arleigh Burke-class destroyer under different perspectives and lighting conditions. (b) Ticonderoga-class cruiser under different perspectives and foggy conditions.

Figure 3. Overall architecture of the WTDBNet.

Figure 4. Illustration of the 2D Discrete Wavelet Transform. Magnified local regions clearly show the decomposition of texture features into horizontal, vertical, and diagonal high-frequency components.

Figure 5. Illustration of low-frequency component enhancement and high-frequency components denoising.

Figure 6. Illustration of WTFusion Block.

Figure 7. Dual-stream backbone network architecture.

Figure 8. Samples and distribution of instance sizes in the HRSC2016 dataset.

Figure 9. Samples and distribution of instance sizes in the ShipRSImageNet dataset.

Figure 10. Illustration of sample adjustment in the HRSC2016-A dataset.

Figure 11. Confusion Matrix on the HRSC2016-A Dataset. For better readability, values greater than 80% are displayed in white font.

Figure 12. Comparisonresults on the HRSC2016-A dataset.

Table 1. Detailed setting of model parameters.

Output Size	Stage	WTCNN Branch	Swin Transformer Branch
256 × 256	Stage0	WTFusion Block
256 × 256	Stage1	$(\begin{matrix} 1 \times 1 c o n v & 64 \\ 3 \times 3 c o n v & 64 \\ 1 \times 1 c o n v & 256 \end{matrix}) \times 3$	$4 \times 4 c o n v, 96, 4$
256 × 256	Stage1		$(\begin{matrix} w i n d o w s i z e 7 \times 7 \\ d i m 192, h e a d 6 \end{matrix}) \times 2$
128 × 128	Stage2	$(\begin{matrix} 1 \times 1 c o n v & 128 \\ 3 \times 3 c o n v & 128 \\ 1 \times 1 c o n v & 512 \end{matrix}) \times 4$	merge 2×2,192-d,LN
128 × 128	Stage2		$(\begin{matrix} w i n d o w s i z e 7 \times 7 \\ d i m 192, h e a d 6 \end{matrix}) \times 2$
64 × 64	Stage3	$(\begin{matrix} 1 \times 1 c o n v & 256 \\ 3 \times 3 c o n v & 256 \\ 1 \times 1 c o n v & 1024 \end{matrix}) \times 6$	merge 2×2,384-d,LN
64 × 64	Stage3		$(\begin{matrix} w i n d o w s i z e 7 \times 7 \\ d i m 384, h e a d 12 \end{matrix}) \times 6$
32 × 32	Stage4	$(\begin{matrix} 1 \times 1 c o n v & 512 \\ 3 \times 3 c o n v & 512 \\ 1 \times 1 c o n v & 2048 \end{matrix}) \times 3$	merge 2×2,768-d,LN
32 × 32	Stage4		$(\begin{matrix} w i n d o w s i z e 7 \times 7 \\ d i m 768, h e a d 24 \end{matrix}) \times 2$

Table 2. Comparisons of experimental results on the HRSC2016 dataset. The best results are highlighted in bold.

Method	Category	Backbone	mAP	AP₅₀	AP₇₅
Gliding Vertex	Two-stage	ResNet50	22.57	49.50	18.50
S²A-Net	One-stage	ResNet50	21.87	38.40	22.90
RoI Transformer	Two-stage	ResNet50	24.74	50.90	21.50
Oriented R-CNN	Two-stage	ResNet50	50.63	71.40	63.40
WTDBNet (Ours)	Two-stage	WTDB	53.95	74.10	67.53

Table 3. Comparisons of experimental results on the ShipRSImagNet dataset. The best results are highlighted in bold.

Method	Category	Backbone	mAP	AP₅₀	AP₇₅
Gliding Vertex	Two-stage	ResNet50	30.42	60.63	26.49
S²A-Net	One-stage	ResNet50	30.55	51.38	33.60
RoI Transformer	Two-stage	ResNet50	38.40	65.22	40.88
Oriented R-CNN	Two-stage	ResNet50	51.83	72.01	62.04
WTDBNet (Ours)	Two-stage	WTDB	55.87	75.11	68.35

Table 4. Comparisons of experimental results on the HRSC2016-A dataset. The best results are highlighted in bold.

Method	Category	Backbone	mAP	AP₅₀	AP₇₅
Gliding Vertex	Two-stage	ResNet50	20.00	40.80	15.20
S²A-Net	One-stage	ResNet50	20.17	35.80	19.80
RoI Transformer	Two-stage	ResNet50	21.05	41.80	18.40
Oriented R-CNN	Two-stage	ResNet50	43.64	63.00	52.90
WTDBNet (Ours)	Two-stage	WTDB	48.16	66.40	57.60

Table 5. Ablation experiments of each component on the HRSC2016 dataset.

Baseline	ResNet50	WTCNN	Swin-T	mAP	AP₅₀	AP₇₅
✓	✓			50.63	71.40	63.40
✓		✓		53.26 (+2.63)	73.30 (+1.90)	65.60 (+2.20)
✓			✓	52.19 (+1.56)	72.80 (+1.40)	64.40 (+1.00)
✓		✓	✓	53.95 (+3.32)	74.10 (+2.70)	67.53 (+4.13)

Table 6. Ablation experiments of each component on the ShipRSImagNet dataset.

Baseline	ResNet50	WTCNN	Swin-T	mAP	AP₅₀	AP₇₅
✓	✓			51.83	72.01	62.04
✓		✓		52.98 (+1.15)	73.02 (+1.01)	66.01 (+3.97)
✓			✓	53.72 (+1.89)	75.71 (+3.70)	66.51 (+4.47)
✓		✓	✓	55.87 (+4.04)	75.11 (+3.10)	68.35 (+6.31)

Table 7. Ablation experiments of each component on the HRSC2016-A dataset.

Baseline	ResNet50	WTCNN	Swin-T	mAP	AP₅₀	AP₇₅
✓	✓			43.64	63.00	52.90
✓		✓		45.97 (+2.33)	64.20 (+1.20)	56.30 (+3.40)
✓			✓	45.85 (+2.21)	66.20 (+3.20)	57.50 (+4.60)
✓		✓	✓	48.16 (+4.52)	66.40 (+3.40)	57.60 (+4.70)

Table 8. Comparisons of FPS and Paras. on the HRSC2016 dataset.

Method	Category	Backbone	FPS	Params.
Gliding Vertex	Two-stage	ResNet50	22.10	41.38 M
S²A-Net	One-stage	ResNet50	21.60	38.90 M
RoI Transformer	Two-stage	ResNet50	18.50	55.47 M
Oriented R-CNN	Two-stage	ResNet50	20.70	41.38 M
WTDBNet (Ours)	Two-stage	WTDB	11.50	69.28 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, W.; Zhao, X.; Wang, H.; Hu, Y. WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection. Remote Sens. 2025, 17, 1570. https://doi.org/10.3390/rs17091570

AMA Style

Cao W, Zhao X, Wang H, Hu Y. WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection. Remote Sensing. 2025; 17(9):1570. https://doi.org/10.3390/rs17091570

Chicago/Turabian Style

Cao, Wei, Xinyu Zhao, Hongqi Wang, and Yuxin Hu. 2025. "WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection" Remote Sensing 17, no. 9: 1570. https://doi.org/10.3390/rs17091570

APA Style

Cao, W., Zhao, X., Wang, H., & Hu, Y. (2025). WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection. Remote Sensing, 17(9), 1570. https://doi.org/10.3390/rs17091570

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WTDBNet: A Wavelet Transform-Based Dual-Stream Backbone Network for Fine-Grained Ship Detection

Abstract

1. Introduction

2. Related Work

2.1. Oriented Object Detection

2.2. Fine-Grained Object Detection

2.3. Transformers in Computer Vision

2.4. Wavelet Transform in Computer Vision

3. Materials and Methods

3.1. Overall Framework

3.2. Two-Dimensional Discrete Wavelet Transform

3.3. Wavelet Transform-Based CNN Branch

3.4. Transformer–CNN Fusion Backbone Network

3.5. Datasets

4. Results

4.1. Experimental Settings

4.2. Evaluation Metrics

4.3. Comparative Experiments

4.4. Ablation Experiment

4.5. Visualization of Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI