S3DR-Det: A Rotating Target Detection Model for High Aspect Ratio Shipwreck Targets in Side-Scan Sonar Images

Quanhong Ma; Shaohua Jin; Gang Bian; Yang Cui; Guoqing Liu; Yihan Wang

doi:10.3390/rs17020312

,

and

Department of Oceanography and Hydrography, Dalian Naval Academy, Dalian 116018, China

^*

Author to whom correspondence should be addressed.

Remote Sens.2025, 17(2), 312;https://doi.org/10.3390/rs17020312

This article belongs to the Special Issue Advancement in Undersea Remote Sensing II

Version Notes

Order Reprints

Abstract

The characteristics of multi-directional rotation and high aspect ratio of targets such as shipwrecks lead to low detection accuracy and difficulty localizing existing detection models for this target type. Through our research, we design three main inconsistencies in rotating target detection compared to traditional target detection, i.e., inconsistency between target and anchor frame, inconsistency between classification features and regression features, and inconsistency between rotating frame quality and label assignment strategy. In this paper, to address the discrepancies in the above three aspects, we propose the Side-scan Sonar Dynamic Rotating Target Detector (S³DR-Det), which is a model with a dynamic rotational convolution (DRC) module designed to effectively gather rotating targets’ high-quality features during the model’s feature extraction phase, a feature decoupling module (FDM) designed to distinguish between the various features needed for regression and classification in the detection phase, and a dynamic label assignment strategy based on spatial matching prior information (S-A) specific to rotating targets in the training phase, which can more reasonably and accurately classify positive and negative samples. The three modules not only solve the problems unique to each stage but are also highly coupled to solve the difficulties of target detection caused by the multi-direction and high aspect ratio of the target in the side-scan sonar image. Our model achieves an average accuracy (AP) of 89.68% on the SSUTD dataset and 90.19% on the DNASI dataset. These results indicate that our model has excellent detection performance.

Keywords:

side-scan sonar; shipwreck targets; rotating target detection; label assignment strategy

1. Introduction

Detection and identification of undersea targets have an extremely important role in underwater search and rescue, marine engineering construction, marine topography and geomorphology surveys, marine resources surveys, and other fields [1,2,3]. However, affected by the complex marine environment, imaging conditions, and measurement means, its detection is more difficult than the detection of natural image targets [4,5]. It is challenging to satisfy the demand for detection accuracy, which has become a hot spot and a difficulty in the current research [6,7,8]. Acoustic detection uses the acoustic image formed by the echo information from the target through the artificial interpretation of the image to detect undersea targets. Because of its mature technology, intuition, highly efficient ease of use, and other characteristics in the current underwater target detection, acoustic detection is widely used [9,10,11]. As the main equipment for seabed topography and geomorphology detection, side-scan sonar has become the mainstream equipment for underwater target detection because of its high imaging resolution [12,13,14].

Currently, advancements in Deep Convolutional Neural Networks (DCNNs) have led to the development of numerous structures tailored specifically for sonar image characteristics, yielding impressive outcomes [15,16,17]. Most of the target detection methods mainly detect targets using horizontal bounding boxes (HBBs), whose most important feature is that their edges align with the horizontal and vertical axes of the image [18,19,20]. Among them, the two-stage detection model R-CNN [21] is based on horizontal frames; it initially creates a network of proposed regions for potential target frames which filters out the candidate regions that may contain targets; then, a network of classification and bounding-box regression is performed on these candidate frames, which further determines the classes and precise locations of targets in the candidate frames. Single-stage detection models such as SSD [22] and the YOLO series [23,24], which simplify the target detection task into a single stage, do not use the intermediate candidate region generation step and perform the prediction of target categories and bounding box locations directly on the input image or feature map. The above methods are all based on the horizontal frame for target detection; however, because the target entities (such as the most common shipwreck targets) in the side-scan sonar images are usually placed in arbitrary directions and have high aspect ratios, as shown in Figure 1, the use of the horizontal frame detection cannot accurately represent the targets in arbitrary directions and will introduce a lot of background information, which brings great challenges to the detection algorithms to accurately locate the directed objects.

Figure 1. When using horizontal bounding boxes to detect directional targets, a lot of background information is often introduced, which interferes with the detection and makes the localization less accurate. The red box is the target labeling box.

Sonar images are often affected by the physic-chemical properties of the water, suspended organic matter, and floating particles, resulting in blurred, low-contrast images with varying degrees of image distortion. These factors make the quality of sonar images generally much poorer than that of optical or radar images in the air. Unlike general images, it is difficult to obtain a distortion-free reference image for sonar images. Therefore, when building a database of sonar images, it is not possible to generate a reference image by adding different levels of distortion, which increases the difficulty of image quality assessment. In the underwater environment, target rotation and tilting are more common. On the one hand, underwater targets are usually placed in arbitrary orientations; on the other hand, the sensors are usually tilted when acquiring sonar images, which increases the difficulty of detection and identification compared to normal images.

Scholars have opened extensive research for rotating target detection networks and have made significant progress [25,26,27,28]. For instance, creating multiple anchor frames with different angles, scales, and high aspect ratios can help alleviate the significant misalignment between oriented objects and horizontally oriented bounding boxes. However, this approach also introduces a significant number of redundant anchor frames, thereby increasing computational complexity and memory consumption. The structure of the RoI Transformer [29] was devised, where horizontal RoIs are converted to rotated RoIs by learning the transformation parameters, effectively mitigating the problems of feature misalignment and the presence of a large number of RRoIs. RRPN [30] incorporates the rotational factor into the region extraction network, making it possible to extract regions with arbitrary angles. Yang [31] et al. introduced the GWD as a means to estimate the regression loss arising from the non-differentiable loss induced by the rotational IoU (Intersection over Union), making it more suitable for the representation of rotating aspect ratio objects. KFIoU [32] proposed an efficient approximation of the SkewIoU loss based on Gaussian modeling and the Kalman filter. In addition, a lot of research and explorations have been carried out on the network structure of rotating detectors, including the model structure of the components, as well as the label assignment strategy [33,34,35].

Rotating target detection is a challenging task, which is more difficult and complex than traditional target detection, mainly in the following three aspects as shown in Figure 2:

Figure 2. Three types of inconsistencies brought by rotating multi-directional targets to the target detection task. (a). Sliding convolution in a fixed direction poses challenges in accurately capturing features of targets oriented in any direction. (b). Red areas are key features for regression and yellow areas are key features for classification. (c). The IoU-based approach to assigning labels makes it difficult to accurately assess the quality of the samples and can result in high-quality negative samples that contain key features (yellow boxes) and low-quality positive samples that do not contain key features (blue boxes).

(1) Inconsistency between target and anchor frame

The convolutional features derived from the backbone network are usually aligned along specific axes and possess a fixed receptive field. However, when dealing with objects distributed in arbitrary directions within sonar images, this can result in a mismatch between the anchor frame and the convolutional features, impeding accurate object characterization. In other words, the anchor frame obtained by the existing method is of low quality and cannot cover the object, resulting in inconsistency between the object and the anchor frame, and it is difficult to represent the whole object with the features of the internal region of the anchor frame. This phenomenon is more significant for objects with high aspect ratios, such as underwater shipwreck targets whose aspect ratios are usually between 1/3 and 1/10, and this misalignment phenomenon exacerbates the imbalance between the target and the background information, hindering performance.

(2) Inconsistency between categorical and regression features

In the undersea target detection model, tasks involving classification and regression depend on features obtained from the backbone, and these features are usually rotationally invariant. However, in the sonar undersea target detection task, the target is characterized by an arbitrary directional distribution, and in the classification task, we need to judge the class of the target using fixed features, i.e., rotationally invariant features. As the target in the side-scan sonar image has the characteristic of rotating in multiple directions, which makes it difficult to get the accurate target position information, with the change of angle, we need to gather characteristics from various angles to sense the change of the target’s position to carry out the accurate localization, which is the rotationally changing features.

(3) Inconsistency between spinning frame quality and label assignment strategy

For directional targets with high aspect ratios, the IoU is very sensitive to changes in angle; small changes in angle will lead to sharp changes in IoU. Moreover, a high IoU does not necessarily mean a good classification effect; due to objects with high aspect ratios, it is difficult to accurately frame all kinds of features of the target in the preset anchor frames, and in some frames with a high IoU, although they summarize the main positional information of the target, they may be better in regression. Some high-IoU frames may be better in regression because although they summarize the main location information of the target, they lack the key features for classification, which leads to poor results and low-quality samples. On the other hand, some low-IoU frames may frame the key features and key location nodes to make them effective, and such high-quality samples are considered negative samples. Therefore, current label assignment techniques solely depend on IoU scores to differentiate between positive and negative samples, leading to an imbalance that adversely affects the model’s performance.

To solve the above problems, we propose the S³DR-Det model, which solves the inconsistency problem in rotating target detection at three levels. First, in the feature extraction stage, we designed dynamic rotational convolution, which can extract high-quality rotational features based on the orientation information of the target. Next, due to the issue of mismatched features needed for classification and regression tasks, we designed a feature decoupling module, which inputs rotationally varying features as well as rotationally invariant features into different task branches, making classification and regression more accurate. Finally, we propose the S-A label assignment strategy in the training strategy, which introduces the concept of alignment, integrates the information of IoU, the distance between centroids, and angular difference, and more comprehensively evaluates the quality of the samples for label assignment. The three modules are efficiently coupled together to finally achieve high-quality detection.

2. Related Work

2.1. Oriented Object Detection

Because the current convolutional structure lacks rotational variability in feature extraction, the classification and regression methods using horizontal frame localization cannot accurately describe the pose information of the wreck target. The existing typical deep learning detectors have difficulty detecting targets with rotating poses in a compact and accurate localization. For this problem, the current mainstream work focuses on the improvement of both rotational feature extraction and label assignment strategy optimization.

Side-scan sonar images often contain objects with different orientations, and the existing standard backbone network makes it difficult to better characterize these arbitrarily oriented objects. Pu [36] et al. proposed an adaptive rotational convolution (ARC), where the rotation angle is predicted by a data-driven routing function, enabling adaptive rotation of the convolution kernel. Additionally, an efficient conditional computation mechanism is introduced to streamline operations and significantly enhance the backbone network’s capacity to accurately detect the representations of oriented objects. Han [37] et al. proposed a rotationally isotropic detector (ReDet), which incorporates a rotationally isotropic network to extract rotationally isotropic features. Based on this, a rotation-invariant RoI align, known as RiRoI Align, was introduced. This method can dynamically extract spatial and directional dimensions from isotropic features, tailored to the orientation of the RoI, ensuring rotationally invariant feature extraction. Ding [29] et al. designed a RoI Transformer structure to acquire the transformation parameters necessary for converting horizontal RoI into rotational RoI under the supervision of rotational bounding box annotation and introduced a rotationally position-aware RoI alignment module to ensure spatially consistent feature extraction. This model effectively mitigates the feature misalignment and the problem of a large number of RRoIs in directional target detection. Yang [38] et al. proposed R³Det to achieve accurate target detection by employing a gradual regression approach from coarse to fine, a feature refinement module that introduces the principle of bilinear interpolation to obtain accurate feature vectors. The SKewIoU loss function, approximated by considering both the direction of gradient propagation and the magnitude of the gradient, provides a more precise rotational estimation. Pan [39] et al. proposed DRN containing FSM and DRH. The FSM allows neurons to flexibly adjust their receptive fields to accommodate targets of various orientations and shapes, while the DRH module optimizes the prediction using kernel weights obtained from dynamic filters. S²ANet [40] contains a feature alignment module, FAM, and an orientation detection module, ODM, with orientation detection. FAM improves the network alignment of convolutional features with anchor points. ODM encodes orientation information with active rotational filters, extracts invariant features, and inputs them into the sub-network for prediction, which improves the accuracy of classification and regression. Yang et al. [41] converted the target angle prediction of a rotational detector into a classification task to constrain the range of prediction. They designed a circular smooth labeling (CSL) mechanism that exploits angular periodicity to enhance classification fault tolerance and reduce the loss of accuracy due to angular prediction errors.

2.2. Sample Selection for Object Detection

Traditional undersea target detection models typically classify samples as positive or negative using the IoU metric. However, this method falls short of accurately assessing the quality of samples with high aspect ratios. Even a slight shift in the anchor frame can drastically reduce the IoU, leading to significant fluctuations in the loss function’s value. This loss instability makes it difficult to accurately reflect the model’s ability to locate regression and affects training stability. Zhu [42] proposed a new method to represent the oriented bounding box, which simplifies label generation by embedding period-different vectors instead of direct regression. An alternative approach for calculating IoU, termed Length-independent IoU (LIIoU), involves truncating the longer side of the target box to maximize the IoU between the candidate and target boxes. This method is particularly suitable for targets with substantial aspect ratios. Zhang [43] et al. pointed out that the main factor affecting the performance of framed and frameless detection methods is the sample sampling method; based on this problem, a dynamic training sample selection method is proposed to define samples by dynamic thresholding, which automatically samples positive and negative samples according to the data features. Hou [44] et al. proposed the shape-adaptive positive and negative sample allocation strategy SASM, in which SA-S combines the target aspect ratio, IoU mean, and variance distribution dynamic sample selection. SA-M, based on this, distinguishes the positive sample quality according to the anchor frame center distance, effectively solving the problem of rotating target detection without considering the shape and quality distinction. Ratio, IoU mean, and variance distributions for dynamic sample selection. SA-M distinguishes the quality of positive samples based on the distance from the center of the anchor frame on this basis, which effectively solves the problem of sample selection in rotating target detection that does not consider the shape and quality differentiation. Ming [45] et al. pointed out that IoU evaluation of anchor quality is unreasonable and proposed the DAL, which uses the MD and the network input-output IoU to dynamically update the iterative matching degree to achieve effective label assignment and mitigate classification and regression inconsistency. Huang [46] et al. proposed the Generalized Gaussian Heat Map Label Assignment Strategy (GGHL), which adopts the target adaptive sampling strategy OLA to rotate elliptical Gaussian region sampling to mitigate the positive and negative sample imbalance and to more closely reflect the target dimensionality and directional properties. Li [47] et al. proposed an adaptive point learning method to capture the geometric information of arbitrarily oriented objects, designing orientation transformation function and quality assessment and sample allocation strategy. Non-axis-aligned salient features are extracted from neighboring objects or backgrounds by selecting representative-oriented repetitive point samples.A²S-Det [48] proposed a sample-balanced adaptive assignment of anchors based on sample balancing, selecting candidate anchors by horizontal IoU, dividing positive and negative anchors by rotational IoU statistical thresholding, designing an adaptive thresholding module to balance the two, and finally accurately regressing the rotational with a relative reference coordinate regression (CR3) module to the bounding box. Yang [31] proposed to use the regression loss of Gaussian Wasserstein Distance (GWD) to solve the boundary discontinuity of the rotating bounding box and the inconsistency with the detection metric. The rotated frame is converted to a 2D Gaussian distribution, and the rotated IoU loss is approximated by the GWD to achieve effective learning.

3. Method

3.1. Network Architecture

In most existing target detectors, the convolutional structure used in the backbone performs feature extraction on the target in an axis-aligned or preset fixed rotation angle, however, objects in side-scan sonar images tend to be placed at arbitrary angles, and thus it is difficult for the existing convolution kernels to accurately extract high-quality features from these arbitrarily orientated objects.

We propose the S³DR-Det rotating target detection model, specifically a dynamic rotating convolution module (DRC) designed to extract the rotating features of objects with arbitrary orientations. The Head part usually consists of two parts, the classification head that determines the target category and the regression head that determines the position and size of the detection frame. However, there is a problem of inconsistency between the features required for classification and regression in the detection of rotating multi-orientation targets. To solve this problem, we design an adaptive feature decoupling modeling (FDM) head in the prediction part. In addition, since the training process of existing target detection models relies only on the IoU approach to judge the quality of samples, this label assignment strategy has great limitations, where low-quality samples are misclassified as positive samples, while high-quality negative samples may not be effectively utilized and mined. Therefore, to address this irrationality we designed the S-A label assignment strategy, which takes into account the three factors of IoU, centroid distance difference, and angle difference as the basis for distinguishing between positive and negative samples, to assess the quality of anchor frames more comprehensively, and to effectively improve the performance of the model, and the overall detector network structure is shown in Figure 3:

Figure 3. General structure of the network.

Backbone is used to extract image features through a series of convolutional layers and activation functions, gradually decreasing the spatial dimension of the image while increasing the number of channels. We replace part of the 3 × 3 convolution in the Backbone network with a DRC module, in which the convolution kernel is dynamically rotated according to different input feature maps to extract the rotated features of the target in any direction. The Neck part uses a Feature Pyramid Networks FPN (Feature Pyramid Networks), which is located between the Backbone and the Head. The FPN is used to extract the features of the target in any direction through the construction of bottom-up and top-down feature fusion paths to fuse feature maps at different scales to generate a feature pyramid with rich multi-scale information. The Head generates the final detection results, i.e., the target’s category, bounding box information, and confidence level, from the feature maps provided by the Neck. We introduce the FDM module in the Head. The combined features are initially processed by an anchor frame optimization module to enhance anchor frames and align features. Subsequently, the refinement detection module employs an active rotation filter for encoding orientation data, producing rotationally variant and invariant features. These features are then fed into separate regression and classification sub-networks to yield the final predictions.

In the training process, we propose a label assignment strategy based on spatial matching a priori information, which takes into account the intersection ratio between the front and back of the output, the distance between the centroids, and the angular difference, etc., to be able to assign the labels more accurately, choose high-quality positive samples that better align with the attributes of rotating frames to enhance label assignment accuracy and adapt to diverse aspect ratios, sizes, and orientations of objects, thereby boosting the model’s generalization capability. This dynamically adjusts the classification of positive and negative samples during training based on real-time conditions, accelerating model convergence, mitigating discrepancies between traditional label assignment strategies and rotating frame quality, and enhancing model performance.

3.2. Dynamic Rotational Convolution Module

The DRC module encodes the input features by deep convolution and average maximum combination pooling and then predicts the rotation angle and weight information by two different activation functions. By combining multiple rotated convolution kernels and then performing convolution operations, the network’s ability to represent targets with different orientations is improved. The specific structure of the DRC module is shown in Figure 4.

Figure 4. DRC module schematic.

First, the image features are input into deep convolution; then, layer normalization and ReLU activation are performed, and then the activated features are merged by mean pooling and maximum pooling to obtain the enriched features, and the merged feature vectors are passed through the linear layer as well as the tanh and sigmoid activation functions, respectively, to obtain the predicted rotational angles

α = [α_{1}, \dots, α_{n}]

and weights

ω = [ω_{1}, \dots, ω_{n}]

. Specifically, we use the tanh activation function to predict the rotation angle as it can output values ranging between [−1, 1] which helps to control the range of the rotation angle. The sigmoid activation function, on the other hand, is used to predict the weight information as it can output values ranging between [0, 1], which helps to control the normalization of the weights.

We set the bias of the linear layer for predicting the angle to false to avoid learning an overly deviated angle. The DRC module contains n kernels (K1, …, Kn) of size k × k. For a given input feature C:

G (A) = α \cdot ω

(1)

Rotate each nucleus individually according to the angle of rotation:

R_{i} = R (K_{i}; α_{i}), i = 1, \dots, n

(2)

where α_i denotes the rotation angle of K_i, R_i is the rotated kernel, and R () is the convolution kernel rotation process. The rotated kernel R_i is convolved with the input feature A and the output features are summed in a pixel-by-pixel manner:

S = ω_{1} \cdot (R_{1} \otimes A) + \dots + ω_{n} \cdot (R_{n} \otimes A)

(3)

where

ω = [ω_{1}, \dots, ω_{n}]

are the predicted weights,

\otimes

is the convolution operation, and S is the combined output feature map.

For the rotated convolution kernel, the core is to consider the convolution kernel as sample points in the ‘convolution space’, extend the parameters of the convolution kernel to two dimensions by interpolation, and then rotate the sample points according to the rotation angle α. Taking a 3 × 3 convolution kernel as an example, the convolution kernel is shown in Figure 5 before any rotation. Introducing the concept of ‘convolution space’ extends this 3 × 3 grid into a larger two-dimensional space so that the space around each weight point can be used for interpolation to generate new weight values. In this larger space, we are free to move and rotate the original weight points, not just confined to the grid lines of the 3 × 3 grid. In this convolutional space, the original weight points are retained, but new positions around them can be interpolated to get new weight values from the original weights. In this way, a continuous weight space is formed instead of a discrete grid. Now, to rotate this convolution kernel, we can sample the weights in the continuous convolution space to form a new rotated convolution kernel.

Figure 5. Schematic diagram of the principle of rotated convolution kernel, the example of which is a 3 × 3 convolution, where the 3 × 3 weight values are first expanded into a continuous two-dimensional space, followed by rotating the original convolution kernel by a certain angle to obtain a new rotated convolution kernel and then re-sampling the weight values from the rotated convolution kernel position to obtain the final rotated convolution kernel.

3.3. Feature Decoupling Module

Addressing the disparity between the features necessary for classification and regression tasks within the detection head of the rotary detection network, we designed an adaptive feature decoupling head structure, the specific mechanism is shown in Figure 6, which mainly consists of two parts, the anchor frame optimization module (AFO) and the dynamic refinement detection module (DRM).

Figure 6. Schematic diagram of the adaptive feature decoupling detection head structure.

High-quality anchor frames are generated by the AFO module and features are adaptively aligned to the corresponding anchor frames with dynamic rotation convolution. Unlike the anchor frame pre-setting method, only one anchor frame is used for each position in the feature map, which is optimized to a high-quality rotated anchor frame by the AFO, which contains an anchor frame regression and a rotational convolution feature alignment operation, where the anchor frame regression optimizes the horizontal anchor frames to high-quality anchor frames with a certain angle of rotation close to the target shape. The features are then dynamically and adaptively aligned according to the shape, size, and orientation of the corresponding anchor frames.

The orientation information is then encoded using the DRE in the DRM module, which is a dynamic rotational encoder that performs dynamic rotation during the convolution process to generate a feature mapping map with multiple orientation channels. In simple terms, it is a pixel-by-pixel maximization of the results for all orientations. If the object is of a certain direction, then the response of the corresponding direction should be larger so that the features of that direction can be extracted. The DRE is a filter of dimensions k × k × N that can rotate N − 1 times dynamically during convolution, resulting in feature maps with N directional channels (with a default of 8). For a feature map A and a DRE (represented by E ( )), the output S of the ith direction can be represented as:

S^{(i)} = \sum_{n = 0}^{N - 1} E_{α_{i}}^{(n)} \cdot A^{(n)}, α_{i} = \frac{2 π i}{N}, i = 0, \dots, N - 1

(4)

where α_i represents the angle of filter rotation and n represents the nth orientation channel. Using DRE for the convolutional layer, we can obtain rotation-invariant features for pairs of rotations encoded based on orientation information. The bounding box regression task can be processed using rotationally varying features; however, the object classification task requires rotationally invariant features, which are extracted by pooling the rotationally varying features, which can be done by selecting the most responsive orientation channel as the output feature S:

\hat{X} = \max X^{(n)}, 0 < n < N - 1

(5)

By doing so, we can arrange object features with different orientations. In contrast to rotationally changing features and rotationally invariant features, good results can be achieved using few parameters.

3.4. S-A Dynamic Label Assignment Strategy

There is irrationality in relying only on IoU to measure the anchor frame quality in rotating target detection, which cannot comprehensively consider the rotation angle, shape difference, and distance information, and easily leads to inaccurate label assignment, which, on the one hand, leads to sample imbalance due to misclassification of positive and negative samples, and, on the other hand, fails to make full use of high-quality negative samples with high localization potential, which leads to instability in training as well as regression uncertainty, that is to say, the problem of inconsistency between the rotating inconsistency between frame quality and label assignment strategy. Based on the above analysis, we propose a dynamic label assignment strategy based on spatial matching prior information (SMPI) and introduce alignment degree as an index to measure the quality of anchor frames, defined as follows:

A D = I o U_{p o s t} - (α \cdot (1 - I o U_{p r e}) + β \cdot \frac{d}{\max_{d}} + γ \cdot \frac{θ}{\max_{θ}})

(6)

where IoU_pre is equivalent to the rotated IoU value before regression, IoUpost is equivalent to the rotated IoU value after regression, d is the distance between the centroid of the rotated prediction frame and the centroid of the rotated real frame, θ is the angular difference between the rotated prediction frame and the rotated real frame, and max_d and max_θ stand for the maximum possible distance and the maximum angular difference, respectively, where α, β, and γ are weighted super parameters to measure the degree of influence between different terms.

During the regression process, effective suppression of interference can collect higher-quality anchor frames and make training more stable. Our penalty term construction for regression uncertainty consists of three parts, namely, the IoU part, the inter-center distance, and the angular difference, which effectively selects high-quality, highly aligned rotated anchor frames from three different aspects. Based on AD, we conduct adaptive anchor frame selection for assigning labels. In the training phase, the AD of the GT frames and the predicted frames are first calculated, and then those anchor frames with AD values greater than or equal to a certain threshold are selected as positive samples, and those below that threshold are negative samples. For GT frames without a matching anchor frame, we designate the anchor frame with the highest AD score as a positive candidate, enabling dynamic label assignment.

4. Experiment and Results

4.1. Dataset

To validate the dynamic rotating target detection network proposed in this paper, we selected SSUTD and DNASI, which contain shipwreck target datasets with high aspect ratios and rotating multi-directions, for experimental validation. The SSUTD (Side-scan Sonar Undersea Target Dataset) is a dataset for shipwreck target detection in side-scan sonar images. The dataset contains 1691 images of shipwreck samples, which are collected by domestic and foreign research institutes and manufacturers using well-known domestic and foreign side-scan sonar equipment in different sea areas, some of them obtained from online platforms. The DNASI (Dalian Naval Academy Side-scan Sonar Images) is a side-scan sonar image dataset collected by the Dalian Naval Academy augmented with a certain amount of data, containing 2123 images of wreck samples [49]. Some of the target sample images are shown in Figure 7, which shows that the targets are usually placed in any direction and have a high aspect ratio, which is suitable for verifying the effectiveness of the dynamic rotating target detection network. We allocate the entire dataset into training, validation, and test sets in a 5:2:3 ratio.

Figure 7. Example dataset.

4.2. Implementation Details

For the experimental part, we built a baseline model based on ResNet, and due to the wide range of sources of training samples, we greyscale all the samples for the generality of the model before inputting them into the network for training. Only random horizontal flipping is used during training to avoid overfitting, and a small learning rate is used to avoid drastic changes in the rotation angle. The optimizer used for training is SGD, the initial learning rate is set to 0.0025, momentum is set to 0.9, and weight decay is set to 0.0001 to avoid overfitting or underfitting. The training consists of 500 warm-up iterations before starting the training and no pre-training weights are used during the training process, which is started from zero. The model was developed in Python, utilizing the PyTorch deep learning library. The experiments were conducted on a Windows 11 operating system, employing hardware that included NVIDIA RTX 4060 GPU and 64 GB of RAM.

Evaluation Metrics

Precision (P) and recall (R) are common metrics for evaluating the performance of an optimization model. A true positive (TP) occurs when a positive sample is correctly predicted as positive, while a true negative (TN) occurs when a negative sample is correctly predicted as negative. Conversely, a false positive (FP) arises when a negative sample is incorrectly predicted as positive, and a false negative (FN) occurs when a positive sample is incorrectly predicted as negative. The computational relationships among these terms are as follows:

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

The geometric interpretation of the average precision (AP) metric lies in the precision-recall (P-R) curve, where P stands for precision and R for recall, corresponding to the area described in Equation (9). This area can be approximated through interpolation and summation techniques to evaluate the target detection accuracy:

A P = \int_{0}^{1} p (r) d x

(9)

4.3. Results

Comparison of S³DR-Det with Existing Methods

We perform comparison experiments with existing rotating target detection methods, which include two-stage detection models and single-stage detection models, and show the results in Table 1. From the results in the table, our model achieves 89.68% AP results, which is better than all the two-stage and single-stage detection models in the table, and higher Recall and AP values are obtained, which illustrates the superiority of this paper’s algorithm in rotating target detection.

Table 1. Experimental results comparing different models.

Some of the detection results with strong representation are shown in Figure 8, which shows that although the target is distributed in the image according to any direction, the detection model can still accurately identify the target and accurately frame it according to the direction of the target, which effectively improves the model’s positioning and recognition accuracy. However, we can also find that there are some targets whose detection frame is slightly shifted because some of the shipwreck targets sunk into the seabed for a long time, with some parts being buried, resulting in the edge of the shipwreck contour and the seabed background being difficult to distinguish from the positioning of the bias. At the same time, due to the observing conditions, instrumentation, and other factors of the target, imaging quality is poor, and by the impact of the water suspension, the target is effective echo is covered up, resulting in difficulty in accurately identifying the target.

Figure 8. Display of some of the test results.

4.4. Ablation Study

4.4.1. FDM

To assess the efficacy of the two modules within the FDM, we performed experiments using various configurations of the detection head and presented the outcomes in Table 2. We can see that adding the AFO module gives 87.24% of AP results, indicating that optimizing the anchor frame first before classification and regression in the detection head is crucial for the final detection results. In addition to this, we found that adding only the DRM module got 86.12% AP results. Convolutional neural networks are not rotationally invariant; even though we were able to extract accurate features with representational power in the backbone network, it is still rotationally variable, so the introduction of the DRE in the DRM increased the orientation information and gave better classification and regression results.

Table 2. Comparative experimental results of different settings of FDM.

4.4.2. S-A

To determine the optimal weighting parameter settings, we designed a parameter preference experiment to find the correlation between the parameters to determine the best parameter combination, with the results displayed in Table 3. Among them, IoU serves as a crucial metric for evaluating the overlap between the predicted and actual frames. In the context of rotating target detection, IoU holds significant weight as it directly indicates the prediction frame’s accuracy. However, too-high IoU weights may lead to a model that is too strict and difficult to learn some complex or deformed targets. The centroid distance measures the difference between the center position of the predicted frame and the real frame. In rotating target detection, the weight of the centroid distance can be appropriately lowered because, even if the centroid distance is far away, it may still be a high-quality positive sample as long as the IoU and angle differences are small. However, too low a centroid distance weight may lead to insufficient learning of position information by the model. The angular difference is a measure of the difference in rotation angle between the predicted frame and the real frame. Since the accuracy of the rotation angle is crucial for target detection, the weight of the angular difference should be relatively high. However, too high an angular difference weight may cause the model to be too sensitive to angular changes, which may affect the learning of other factors.

Table 3. Experimental results of parameter optimization for S-A.

To verify the effect of DRC, FDM, and S-A designed in this paper, we perform ablation experiments using different combinations of modules to verify the effect they produce on the model, and the results are shown in Table 4.

Table 4. Module ablation experiments in the model.

We will configure our proposed DRC, FDM, and S-A in the baseline model to perform ablation experiments to validate the effect of each part on the model and experiment on the dataset. From the results, we can find that each module configured individually into the baseline model has a significant improvement. Compared with the baseline model using static convolution, the addition of the DRC module allows the rotational convolution energy to dynamically perform angular alignment, which demonstrates the adaptability and effectiveness of the DRC module in capturing rotating targets. The FDM module can effectively decouple the extracted rotational features and input the corresponding features into the feature branching network to achieve the effect of fine localization and classification so that the addition of the FDM module will effectively improve the model detection accuracy to achieve higher accuracy and higher AP values. Finally, S-A introduces an alignment degree that is more in line with the rotated frame to measure the quality of the anchor frame and introduces the input IoU, the centroid distance difference, and the angle difference in the alignment degree to be more comprehensive, and more realistically selects high-quality anchor frames to improve the model accuracy. The model proposed in this paper, which designs three modules that can be highly coupled for the rotating target in the feature extraction phase, target detection phase, and model training phase, obtains an AP result of 89.68%, which is an improvement of 8.62% compared with the baseline model, effectively proving the superiority of the model in this paper.

5. Conclusions

In this paper, we propose the S³DR-Det model for detecting multi-directional, high aspect ratio targets of shipwrecks in side-scan sonar images. Through our proposed DRC, FDM, and S-A label assignment strategies, the inconsistency between the target and anchor frames, the inconsistency between classified features and regression features, and the inconsistency between rotating frame quality and existing label assignment strategies are observed. The modules in this paper’s model can be highly coupled both functionally and structurally, solving the problems posed by arbitrarily oriented high aspect ratio targets at different stages of the model. Experimental results show that our S³DR-Det achieves optimal detection results on two side-scan sonar wreck datasets (SSUTD and DNASI), reaches the optimal detection performance level, and effectively solves the task of target detection for rotating targets with multiple directions. The S³DR-Det algorithm performs well in multi-directional and high aspect ratio shipwreck target detection tasks, with the advantages of high detection accuracy, dynamic rotational convolution, feature decoupling, and spatially aware label assignment. However, the algorithm also suffers from the disadvantages of high computational complexity, high data quality requirements, and complex hyperparameter adjustment, and the model generalization ability needs to be verified. Future work will focus on further optimizing the efficiency and generalization ability of the algorithm.

Author Contributions

Conceptualization, Q.M. and S.J.; methodology, Q.M.; validation, S.J., G.B., Y.C., G.L. and Q.M.; formal analysis, S.J. and G.L.; investigation, Q.M. and Y.W.; resources, S.J. and Y.C.; data curation, Q.M. and Y.W.; writing—original draft preparation, Q.M.; writing—review and editing, S.J.; visualization, Q.M.; supervision, S.J. and G.B.; project administration, S.J.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 41876103.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Access to the data will be considered upon request by the authors. Partial dataset from https://github.com/huoguanying/SeabedObjects-Ship-and-Airplane-dataset (accessed on 5 June 2024) and https://github.com/freepoet/SCTD (accessed on 19 June 2024).

Acknowledgments

We would like to thank the editor and the anonymous reviewers for their valuable comments and suggestions that greatly improve the quality of this paper. We thank our colleagues for their helpful suggestions during the experiment. Finally, we would like to thank the open-source project of mmrotate.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tang, Y.; Li, H.; Zhang, W.; Bian, S.; Zhai, G.; Liu, M.; Zhang, X. Light weight DETR-YOLO method for detecting shipwreck target in side-scan sonar. Syst. Eng. Electron. 2022, 44, 2427–2436. [Google Scholar]
Yu, Y.; Zhao, J.; Gong, Q.; Huang, C.; Zheng, G.; Ma, J. Real-Time Underwater Maritime Object Detection in Side-Scan Sonar Images Based on Transformer-YOLOv5. Remote Sens. 2021, 13, 3555. [Google Scholar] [CrossRef]
Yulin, T.; Jin, S.; Bian, G.; Zhang, Y. Shipwreck Target Recognition in Side-Scan Sonar Images by Improved YOLOv3 Model Based on Transfer Learning. IEEE Access 2020, 8, 173450–173460. [Google Scholar] [CrossRef]
Yulin, T.; Shaohua, J.; Gang, B.; Yonzhou, Z.; Fan, L. Wreckage Target Recognition in Side-scan Sonar Images Based on an Improved Faster R-CNN Model. In Proceedings of the 2020 International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE), Bangkok, Thailand, 30 October–1 November 2020; pp. 348–354. [Google Scholar] [CrossRef]
Yulin, T.; Shaohua, J.; Fuming, X.; Gang, B.; Yonghou, Z. Recognition of Side-scan Sonar Shipwreck Image Using Convolutional Neural Network. In Proceedings of the 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 23–25 October 2020; pp. 529–533. [Google Scholar] [CrossRef]
Ma, Q.; Jin, S.; Bian, G.; Cui, Y. Multi-Scale Marine Object Detection in Side-Scan Sonar Images Based on BES-YOLO. Sensors 2024, 24, 4428. [Google Scholar] [CrossRef]
Tang, Y.; Wang, L.; Jin, S.; Zhao, J.; Huang, C.; Yu, Y. AUV-Based Side-Scan Sonar Real-Time Method for Underwater-Target Detection. J. Mar. Sci. Eng. 2023, 11, 690. [Google Scholar] [CrossRef]
Kong, W.; Hong, J.; Jia, M.; Yao, J.; Cong, W.; Hu, H. YOLOv3-DPFIN: A Dual-Path Feature Fusion Neural Network for Robust Real-Time Sonar Target Detection. IEEE Sens. J. 2020, 20, 3745–3756. [Google Scholar] [CrossRef]
Zhao, J.; Li, J.; Li, M. Progress and Future Trend of Hydrographic Surveying and Charting. J. Geomat. 2009, 34, 25–27. [Google Scholar]
Wang, J.; Cao, J.; Lu, B.; He, B. Underwater Target Detection Project Equipment Application and Development Trend. China Water Transp. 2016, 11, 43–44. [Google Scholar]
Wang, X.; Wang, A.; Jiang, T. Review of application areas for side scan sonar image. Surv. Mapp. Bull. 2019, 1, 1–4. [Google Scholar] [CrossRef]
Wang, J.; Zhou, J. Comprehensive Application of Side-scan Sonar and Multi-beam System in Shipwreck Survey. China Water Transp. 2010, 10, 35–37. [Google Scholar]
Lu, Z.; Zhu, T.; Zhou, H.; Zhang, L.; Jia, C. An Image Enhancement Method for Side-Scan Sonar Images Based on Multi-Stage Repairing Image Fusion. Electronics 2023, 12, 3553. [Google Scholar] [CrossRef]
Xie, Y.; Bore, N.; Folkesson, J. Bathymetric Reconstruction From Sidescan Sonar With Deep Neural Networks. IEEE J. Ocean. Eng. 2023, 48, 372–383. [Google Scholar] [CrossRef]
Neupane, D.; Seok, J. A Review on Deep Learning-Based Approaches for Automatic Sonar Target Recognition. Elecronics 2020, 9, 1972. [Google Scholar] [CrossRef]
Yang, D.; Wang, C.; Cheng, C.; Pan, G.; Zhang, F. Semantic Segmentation of Side-Scan Sonar Images with Few Samples. Electronics 2022, 11, 3002. [Google Scholar] [CrossRef]
Hsu, W.-Y.; Lin, W.-Y. Ratio-and-Scale-Aware YOLO for Pedestrian Detection. IEEE Trans. Image Process. 2021, 30, 934–947. [Google Scholar] [CrossRef]
Li, T.; Zhang, Z.; Zhu, M.; Cui, Z.; Wei, D. Combining transformer global and local feature extraction for object detection. Complex Intell. Syst. 2024, 10, 4897–4920. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight Underwater Object Detection Based on YOLO v4 and Multi-Scale Attentional Feature Fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Dong, X.; Yan, S.; Duan, C. A lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–24 June 2013; pp. 580–587. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016. ECCV 2016. Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Ali, F. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward Arbitrary-Oriented Ship Detection With Rotated Region Proposal and Discrimination Networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Deng, L.; Gong, Y.; Lu, X.; Lin, Y.; Ma, Z.; Xie, M. STELA: A Real-Time Scene Text Detector With Learned Anchor. IEEE Access 2019, 7, 153400–153407. [Google Scholar] [CrossRef]
Yang, X.; Fu, K.; Sun, H.; Yang, J. R2CNN++: Multi-Dimensional Attention Based Rotation Invariant Detector with Robust Anchor Strategy. arXiv 2018, arXiv:1811.07126. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.; Lu, Q. Learning RoI Transformer for Detecting Oriented Objects in Aerial Images. arXiv 2018, arXiv:1812.00155. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multimed. 2017, 20, 3111–3122. [Google Scholar] [CrossRef]
Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting Rotated Objects as Gaussian Distributions and its 3-D Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4335–4354. [Google Scholar] [CrossRef]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU Loss for Rotated Object Detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Yi, Y.; Da, F. Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2022; pp. 13354–13363. [Google Scholar]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.-S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive Rotated Convolution for Rotated Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Vancouver, BC, Canada, 17–24 June 2022; pp. 6566–6577. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.-S. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2785–2794. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic Refinement Network for Oriented and Densely Packed Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11204–11213. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. On the Arbitrary-Oriented Object Detection: Classification Based Approaches Revisited. Int. J. Comput. Vis. 2022, 130, 1340–1365. [Google Scholar] [CrossRef]
Zhu, Y.; Du, J.; Wu, X. Adaptive Period Embedding for Representing Oriented Objects in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7247–7257. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9756–9765. [Google Scholar]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-Adaptive Selection and Measurement for Oriented Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 923–932. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection. arXiv 2020, arXiv:2012.04150. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.-G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef]
Li, W.; Zhu, J. Oriented RepPoints for Aerial Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2022; pp. 1819–1828. [Google Scholar]
Xiao, Z.; Wang, K.; Wan, Q.; Tan, X.; Xu, C.; Xia, F. A²S-Det: Efficiency Anchor Matching in Aerial Image Oriented Object Detection. Remote Sens. 2021, 13, 73. [Google Scholar] [CrossRef]
Peng, C.; Jin, S.; Bian, G.; Cui, Y. SIGAN: A Multi-Scale Generative Adversarial Network for Underwater Sonar Image Super-Resolution. J. Mar. Sci. Eng. 2024, 12, 1057. [Google Scholar] [CrossRef]

Figure 1. When using horizontal bounding boxes to detect directional targets, a lot of background information is often introduced, which interferes with the detection and makes the localization less accurate. The red box is the target labeling box.

Figure 2. Three types of inconsistencies brought by rotating multi-directional targets to the target detection task. (a). Sliding convolution in a fixed direction poses challenges in accurately capturing features of targets oriented in any direction. (b). Red areas are key features for regression and yellow areas are key features for classification. (c). The IoU-based approach to assigning labels makes it difficult to accurately assess the quality of the samples and can result in high-quality negative samples that contain key features (yellow boxes) and low-quality positive samples that do not contain key features (blue boxes).

Figure 3. General structure of the network.

Figure 4. DRC module schematic.

Figure 5. Schematic diagram of the principle of rotated convolution kernel, the example of which is a 3 × 3 convolution, where the 3 × 3 weight values are first expanded into a continuous two-dimensional space, followed by rotating the original convolution kernel by a certain angle to obtain a new rotated convolution kernel and then re-sampling the weight values from the rotated convolution kernel position to obtain the final rotated convolution kernel.

Figure 6. Schematic diagram of the adaptive feature decoupling detection head structure.

Figure 7. Example dataset.

Figure 8. Display of some of the test results.

Table 1. Experimental results comparing different models.

	Model	Dataset	R (%)	AP (%)	Dataset	R (%)	AP (%)
two-stage	RoI Transformer	SSUTD	91.83	89.19	DNASI	90.85	87.64
	Oriented R-CNN		91.23	88.76		90.50	85.82
	Rotated Faster R-CNN		86.77	74.98		86.12	73.64
	Gliding Vertex		75.68	63.12		78.74	64.28
	CFA		86.12	73.64		87.33	75.87
one-stage	Rotated RetinaNet	SSUTD	81.53	76.21	DNASI	83.58	76.28
	S²Anet		88.74	81.06		87.52	79.63
	ATSS		83.1	76.7		83.37	75.92
	DRN		83.58	76.28		85.12	77.56
	R3Det		84.11	77.75		81.53	76.21
	R³Det-KFIoU		87.52	79.63		89.16	81.20
	S³DR-Det		92.70	89.68		93.98	90.19

Table 2. Comparative experimental results of different settings of FDM.

AFO	DRM	R (%)	AP (%)
√	-	90.12	87.24
-	√	88.98	86.12
√	√	92.70	89.68

Table 3. Experimental results of parameter optimization for S-A.

α	β	γ	AP (%)
0.4	0.3	0.3	88.42
	0.2	0.4	83.15
	0.1	0.5	79.54
0.5	0.3	0.2	88.82
	0.2	0.3	89.68
	0.1	0.4	84.10
0.6	0.3	0.1	77.95
	0.2	0.2	83.62
	0.1	0.3	75.56

Table 4. Module ablation experiments in the model.

DRC	FDM	S-A	R (%)	AP (%)
-	-	-	88.74	81.06
√	√	-	90.54	85.41
√	-	√	90.69	85.92
-	√	√	89.48	84.38
√	√	√	92.70	89.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

S³DR-Det: A Rotating Target Detection Model for High Aspect Ratio Shipwreck Targets in Side-Scan Sonar Images

Abstract

1. Introduction