RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images

Li, Xiaoxue; Chen, Yuhan; Liu, Xueqin; Qin, Zhiliang; Wan, Jiaxin; Yan, Qingyun

doi:10.3390/rs17193288

Open AccessArticle

RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images

by

Xiaoxue Li

^1,2,3

,

Yuhan Chen

^1,2,3

,

Xueqin Liu

^1,2,3,*

,

Zhiliang Qin

^1,2,3

,

Jiaxin Wan

¹ and

Qingyun Yan

⁴

¹

National Key Laboratory of Underwater Acoustic Technology, Harbin Engineering University, Harbin 150001, China

²

Key Laboratory of Marine Information Acquisition and Security, Ministry of Industry and Information Technology, Harbin Engineering University, Harbin 150001, China

³

College of Underwater Acoustic Engineering, Harbin Engineering University, Harbin 150001, China

⁴

School of Remote Sensing and Geomatics Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3288; https://doi.org/10.3390/rs17193288

Submission received: 24 July 2025 / Revised: 12 September 2025 / Accepted: 22 September 2025 / Published: 25 September 2025

(This article belongs to the Special Issue Efficient Object Detection Based on Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

YOLO is highly effective in underwater object detection, and the RCF-YOLOv8 network, proposed for forward-looking sonar images, further enhances underwater object detection capabilities.
RCF-YOLOv8 significantly enhances spatial perception, feature representation, and fusion quality, thereby reducing false positives and missed detections in complex underwater environments.

What is the implication of the main finding?

This method provides a robust and efficient solution for underwater object detection, helping to enhance the practical application capabilities of underwater unmanned systems in target identification and operational tasks.
This research provides new insights into acoustic image analysis and highlights the importance of domain adaptation mechanisms in detection tasks.

Abstract

Acoustic imaging systems are essential for underwater target recognition and localization, but forward-looking sonar (FLS) imagery faces challenges due to seabed variability, resulting in low resolution, blurred images, and sparse targets. To address these issues, we introduce RCF-YOLOv8, an enhanced detection framework based on YOLOv8, designed to improve FLS image analysis. Key innovations include the use of CoordConv modules to better encode spatial information, improving feature extraction and reducing misdetection rates. Additionally, an efficient multi-scale attention (EMA) mechanism addresses sparse target distributions, optimizing feature fusion and improving the network’s ability to identify key areas. Lastly, the C2f module with high-quality feature fusion (C2f-Fusion) optimizes feature extraction from noisy backgrounds. RCF-YOLOv8 achieved a 98.8% mAP@50 and a 67.6% mAP@50-95 on the URPC2021 dataset, outperforming baseline models with a 2.4% increase in single-threshold accuracy and a 10.4% increase in multi-threshold precision, demonstrating its robustness for underwater detection.

Keywords:

forward-looking sonar images; underwater object detection; YOLO; spatial information; feature fusion; attention mechanism

1. Introduction

Object detection techniques are widely employed in remote sensing and geospatial analysis, covering many aspects from land use monitoring to natural disaster assessment. By analyzing remote sensing images, researchers can automatically identify and classify surface features such as urban expansion, crop types, and forest cover. Advances in modern technology have enabled the extension of object detection methodologies to sonar applications, particularly in underwater scenarios [1,2,3,4,5,6]. In ocean exploration, sonar systems obtain information on the underwater environment by emitting sound waves, and target detection technology can be applied to effectively identify seabed targets, such as shipwrecks, reefs, and marine life. This cross-field technological integration not only improves the understanding of marine resources but also strengthens marine environmental monitoring and regulation. In the domain of submarine target detection and marine environmental surveillance, sonar-based underwater object detection has emerged as a widely adopted and highly effective technique [7]. The real-time and intelligent recognition of submarine targets through advanced sonar image processing holds significant importance for both civilian and military applications. These applications include underwater rescue operations; wreckage identification; submarine recovery; and military operations, such as submarine tracking and mine detection [8].

Traditional sonar image analysis methodologies are constrained by their reliance on manually designed features, including pixel value analysis, grayscale thresholding, and the utilization of prior knowledge of target attributes [9]. While effective, these methods are restricted in their processing capabilities and often fail to achieve optimal detection accuracy. Additionally, conventional classification approaches rooted in machine learning paradigms demonstrate heightened vulnerability to noise interference, exhibiting pronounced performance degradation in acoustically challenging environments, thereby lacking operational robustness [10].

Yann LeCun et al. proposed the theory of deep learning, expounded the back-propagation algorithm, realized end-to-end learning, and promoted breakthroughs in multiple fields [11]. In the modern era, deep learning has emerged as a dominant force and has provided transformative solutions to the challenges of computer vision tasks. For target recognition, Deep Neural Networks (DNNs) enable hierarchical feature extraction, transitioning from low-level, task-specific features to high-level, abstract representations, which helps to identify key target features [12]. By leveraging deep learning in sonar image recognition, detection accuracy and robustness can be significantly improved, thereby minimizing reliance on manual feature engineering [13]. These advances not only promote fully automated and intelligent recognition systems but also lay the foundation for autonomous decision-making in underwater unmanned systems, such as autonomous underwater vehicles (AUVs). This integration supports real-time target detection and high-precision classification, promoting the development of next-generation underwater detection and monitoring technologies [14].

Object detection represents a well-established research domain in deep learning. For systematic analysis, object detection architectures are fundamentally divided into two distinct methodological approaches: region proposal- and regression-based frameworks. The former frameworks, exemplified by R-CNN [15], Fast R-CNN [16], Faster R-CNN [17], etc., have sequential pipeline architectures. These frameworks first generate candidate object regions through proposal mechanisms, and then classification and bounding box regression operations are performed in these regions. Conversely, the latter frameworks, represented by SSD [18], RetinaNet [19], YOLO [20], etc., eliminate the region proposal step, and they can simultaneously output both the object class and its bounding box location in a single forward pass. This architectural simplification typically results in substantially reduced inference latency, rendering single-stage detectors particularly advantageous for time-sensitive tasks demanding optimized computational performance.

YOLO, as the foundational work of the single-stage target detection paradigm, redefined the detection process through a fully convolutional end-to-end architecture [20]. The central principle at play here is the aim to change the nature of object detection from a traditional process to a regression task. The architecture executes concurrent spatial regression and categorical prediction within a unified forward propagation, generating both object localization parameters and category confidence scores in parallel. As a result, there is no longer a need to explicitly generate region proposals. Over time, the YOLO series has seen significant advancements in feature extraction, fusion, and predictions. YOLOv2 has been adapted for target recognition in high-resolution images [21]; YOLOv3 [22], YOLOv4 [23], YOLOv6 [24], YOLOv7 [25], and YOLOv8 [26,27] utilize updated backbone networks to enhance the extraction of deeper features. Furthermore, YOLOv3 [22], YOLOv4 [23], and YOLOv5 [28,29] emphasize feature fusion in the neck networks to improve feature diversity and robustness. Meanwhile, YOLOX [30,31], YOLOv6 [24], YOLOv7 [25], and YOLOv8 [26,27] effectively decouple classification and regression tasks.

Although the YOLO algorithm has been successfully applied to natural environments, it still faces significant challenges in sonar image target recognition. The marine environment is highly complex and variable, with numerous interfering elements, including marine environmental noise, ship self-noise, and seabed reverberation, which lead to blurred edges and a loss of detailed information in sonar image targets. Real-time underwater object recognition in sonar images is significantly hindered by these factors [32]. Across the globe, researchers have carried out in-depth investigations into the utilization of the YOLO algorithm for underwater target detection. For example, Li et al. compared the performance of YOLOv5 between underwater sonar datasets and public datasets and reported a lower detection accuracy on the former [33]. Similarly, Xie et al. applied YOLOX to sonar images, successfully identifying targets, although the detection accuracy remained limited [34]. These findings highlight the need for strategies to mitigate interference and address environmental challenges in order to optimize the detection efficacy of YOLO-based architectures in underwater target recognition scenarios.

In response to the unique demands of sonar-based detection, the YOLO framework has been adapted and optimized by researchers worldwide, demonstrating enhanced adaptability and robustness in underwater environments. Early research mainly focused on basic verification to verify its feasibility. For example, Steiniger et al. compared the automatic target recognition performance of YOLOv2 and YOLOv3 in side-scan sonar images. The results showed that YOLOv3 performed better than YOLOv2 and verified the effectiveness of transfer learning with limited data [35]. However, their work mainly serves as a baseline for comparison and lacks in-depth architectural innovations to address inherent challenges in sonar, such as acoustic shadowing or speckle noise. Subsequent work focuses on feature enhancement and is planned to explore a new adaptive feature selection method to differentiate from existing approaches. Fan et al. improved the utilization of multi-scale features in YOLOv4 by introducing the ASFF module [36], while Li et al. used the DCSP chain backbone network to reduce redundant gradient information [2]. However, these methods mainly optimize the aggregation of features at each scale independently and do not fully address the challenge of modeling cross-scale contextual dependencies. The ASFF module performs weighted fusion but may still ignore the complex nonlinear interactions between features with large resolution differences. Similarly, while the DCSP chain effectively simplifies the information flow in the backbone network, its ability to explicitly capture and propagate semantically rich multi-scale contextual cues is still limited, which is particularly important for distinguishing low-contrast targets from noisy sonar backgrounds. Chen et al. innovatively integrated the Swin Transformer [37] into YOLOv5 [38], while Zheng et al. introduced the SPPFCSPC module to YOLOv5 to enhance its multi-scale feature extraction capabilities [6]. However, although the Swin Transformer can establish long-range dependencies, its window attention mechanism may cause semantic barriers between feature layers of different scales, making it difficult to achieve truly unified multi-scale semantic alignment; the SPPFCSPC module expands the receptive field through multi-scale pooling, but the pooling operation itself will lose some spatial detail information. When fusing extreme scale features, subtle but critical target contour features are still easily diluted or lost in the layered transmission.

The YOLO series of models have demonstrated significant practical value in underwater target detection tasks. Martin et al. developed a side-scan sonar harbor’s wall structure detection system based on YOLOX, with an mAP50 of 91.3% [39]. By fusing YOLOv8 and BoT-SORT tracking methods, Xing et al. developed a real-time fish school counting system that achieved 78.5% tracking accuracy [40]. For the optimization of specific scenarios, by implementing a multi-scale feature fusion strategy, Lin et al. achieved significant improvements in side-scan sonar shipwreck recognition accuracy [41]. Peng et al. introduced the DDPM diffusion model to enhance sonar image quality and improve shipwreck detection performance [42]. Qu et al. proposed CBAM-YOLOX, which improved the localization accuracy of small targets through a channel attention mechanism [43]. Zhang et al. incorporated dedicated detection layers in YOLOv7, increasing the mAP for small objects by 6.7% [44]. Yaoming Zhuang et al. proposed the UWNet model by combining the Structured Space Model (SSM) and the feature enhancement mechanism, which effectively solved the problem of balancing accuracy and efficiency in underwater target detection and provided a feasible technical solution for the deployment of underwater robot platforms [45]. These studies highlight the significant progress made in underwater object detection tasks. However, current research focuses on optical images, while research on sonar images, especially forward-looking sonar images, is relatively rare, and both data and models are even more scarce.

While recent studies have achieved notable advancements in enhancing detection accuracy and speed for underwater target detection, they have not fully harnessed the rich spatial information that is inherent in sonar images. Additionally, although advances in feature fusion have been achieved, the quality of sonar image fusion in the early stages has frequently been overlooked. To maximize the potential of sonar images for underwater target detection, further improvements in feature extraction and fusion techniques are essential while carefully considering the unique characteristics of sonar data.

While YOLOv8 demonstrates state-of-the-art performance in natural imagery, its application to sonar data remains underexplored, necessitating specialized research to address the unique challenges posed by acoustic signatures. In this study, we introduce relevant improvements and optimizations to enhance the YOLOv8 model. Specifically, we aim to reduce false detection and missed detection rates while enhancing the detection accuracy of seabed targets by utilizing spatial information from forward-looking sonar images. Our research emphasizes improving target feature expression and refining multi-scale feature fusion strategies tailored to sonar imagery. To tackle the challenges, we present RCF-YOLOv8, an advanced architecture that is optimized for forward-looking sonar data, which achieves higher precision while minimizing both false positives and negatives in forward-looking sonar datasets. The core innovation lies in adaptive feature aggregation mechanisms that mitigate the adverse effects of acoustic noise and spatial inhomogeneity, thereby improving localization accuracy in complex underwater environments. Our method uses forward-looking sonar (FLS) images as input, which often face challenges such as low resolution, blurred imaging, sparse target representation, and inadequate feature quality. These issues are often the result of spatial inhomogeneities in seafloor topography, substrate variations, seabed roughness, and the complex seabed acoustic field. As the quality of feature extraction heavily influences target detection in FLS images, our model leverages the spatial information that is inherent in these sonar images to enhance both feature extraction and the fusion processes. By addressing these limitations, the RCF-YOLOv8 model outperforms the baseline YOLOv8 model and achieves superior detection performance, which is critical for reliable underwater object localization in complex FLS operation scenarios. The primary innovations presented in this work can be summarized as follows:

CoordConv is used to solve the problem of the insufficient spatial perception of ordinary convolution. By explicitly embedding coordinate information, the network’s modeling ability of the target spatial position is directly improved.
The backbone network is combined with the EMA module for cross-space learning to solve the problem of feature incoherence that is caused by sparse target distribution in FLS images. The relevance of multi-scale features is enhanced through cross-channel and cross-space attention mechanisms, the model’s ability to represent sparse targets is improved, and a new connection strategy is introduced to reduce information loss.
The C2f-Fusion module is proposed to reduce the impact of FLS image blurring. By optimizing feature fusion, the fusion quality is improved and context information is captured more effectively.

2. Methods

YOLOv8 significantly enhances detection efficiency and real-time performance over its predecessors by adopting a fully single-pass network architecture for bounding box prediction. Key improvements include the introduction of anchor-free detection, which simplifies the detection process and reduces computational overhead, as well as an optimized non-maximum suppression mechanism that improves upon the YOLOv5 framework. These advancements enable YOLOv8 to achieve state-of-the-art performance in speed and accuracy across various vision tasks, including object recognition, image classification [46], and instance segmentation [47]. As forward-looking sonar images have low resolution, they are first improved through contrast enhancement and image denoising. Based on YOLOv8, we propose the RCF-YOLOv8 underwater target detection algorithm, specifically designed to boost the identification and categorization capacities of sonar images. As depicted in Figure 1, RCF-YOLOv8 is composed of three key components: a backbone network for hierarchical representation learning, a multi-scale neck for cross-layer feature aggregation, and a decoupled detection head for task-specific predictions. Specifically, convolutional and C2f-based processing operations are applied by the backbone module to input forward-looking sonar images, enabling effective feature extraction for subsequent analysis. Within the backbone network, CoordConv takes the place of Conv. By adding x and y coordinate channels, the convolution can detect the spatial awareness encoded in sonar images. This architectural improvement enables the network to effectively capture the geometric dependencies among features and their spatial locations, thereby reducing the loss of spatial information. Additionally, the EMA module based on cross-space learning is incorporated. By providing aggregation methods to facilitate inter-dimensional spatial feature fusion, feature aggregation is enriched, generating more recognizable features. This helps the model better learn target features and enhances its representation ability. The model focuses on target features rather than a large number of background areas, meeting the accuracy and efficiency requirements in forward-looking sonar-based object detection tasks. Simultaneously, a new connection method is designed to integrate cross-layer contextual information, allowing the network to use information more flexibly, thereby reducing information loss. In the subsequent stage, the backbone-derived features are passed to the neck module, which performs inter-scale information integration. The neck module consists of a PAN-FPN structure. Through bottom-up and top-down paths, features of different scales are fused to obtain richer semantic information and more accurate target representations. This enhancement significantly boosts the architecture’s ability to recognize objects across diverse size ranges. The C2f-Fusion module employs the multi-scale channel attention module to integrate features that have inconsistent semantics and scales. This improves the model’s initial fusion quality, fully perceives contextual information, and more effectively extracts target features from sonar images containing various types of noise and complex backgrounds. Finally, the decoupled detection head classifies and locates the target, outputting the sonar image target detection results.

2.1. CoordConv

The essence of the convolution layer in computer vision is to perform hierarchical feature abstraction via sliding window convolutions and to use these features for further processing and analysis. Forward-looking sonar images usually contain rich spatial information, but traditional convolution operations often fail to fully utilize this information, leading to reduced accuracy in target detection and recognition. To address this issue, we replace the traditional convolution layer with CoordConv and directly incorporate spatial coordinate information into the convolution operation, thereby improving localization accuracy in underwater sonar imagery. The overall structural design of the CoordConv mechanism is illustrated in Figure 2.

The CoordConv layer serves as an easy-to-understand extension of the standard convolution that adds specific functionality. The goal of CoordConv is to understand the location of pixels through filters, thereby establishing a mapping between Cartesian space and pixel space [48]. First, the CoordConv method involves augmenting the input layer with explicit spatial position information through two appended channels representing x and y coordinates and applying a final linear scaling to the x and y coordinate values to adjust their range to [−1, 1] [49,50]. It is important to note that CoordConv adds the coordinate information across all pixels to the input feature map, enabling the model to explicitly encode spatial coordinate information during feature extraction to enhance positional awareness. This enhances the perception of target location and improves detection accuracy. By repurposing the partial network capacity to model non-shift-invariant features, CoordConv achieves better training convergence despite reduced translation invariance. This flexibility improves the network’s ability to handle challenging target classification scenarios with varying spatial distributions in forward-looking sonar images, thereby improving classification accuracy. In summary, considering the special and distinguishable traits of sonar images, the introduction of the CoordConv layer helps capture the spatial location information of the target, allowing the model to demonstrate stronger generalization capabilities when dealing with targets in various environmental conditions.

2.2. EMA Module

Forward-looking sonar images often contain sparse targets and suffer from low-quality feature representation, posing challenges for target recognition and classification models. Because targets may be unevenly distributed in the image and feature information is limited, traditional detection algorithms may struggle to capture key target features, increasing false alarm rates in cluttered underwater sonar imagery. Additionally, the spatial heterogeneity of seabed topography, texture, and roughness, along with the complex seabed sound field, often results in low resolution and blurred imaging in sonar images. The shape and texture of targets may resemble those of the surrounding environment, increasing the difficulty of recognition and classification for the model.

This study revolves around the block-based and modular deep learning strategies, as used in previous research (searching for keywords dealing with Intra prediction of variable blocks, landslide mapping using a hybridized block modular model, overfitted block-based prediction, and image denoising using block matching and 3D filtering). Integrating these deep learning methods with attention mechanisms can then provide enhanced performance models. As a result, a lightweight and efficient model that emphasizes channel-wise relationships over spatial ones was presented by the Squeeze-and-Excitation Network (SENet) [51]. The Convolutional Block Attention Module (CBAM) [52] sequentially combines channel and spatial attention but may fall short in capturing cross-dimensional interactions. Conversely, non-local neural networks [53] capture long-range dependencies by computing the correlation between all locations in the feature map. However, their high computational and memory requirements limit their use with high-resolution inputs.

To enhance feature learning, we introduce the EMA module into the YOLOv8 framework, as depicted in the architectural diagram in Figure 3. Unlike conventional attention modules, EMA not only exploits channel relationships to highlight informative features but also retains fine-grained spatial structures. This design ensures that the model focuses on discriminative target patterns rather than irrelevant background regions. Additionally, the EMA module optimizes feature extraction by establishing a parallel sub-network, which ensures that the accuracy and efficiency requirements for target detection in forward-looking sonar imagery are satisfied [54,55]. Finally, the EMA module enhances feature aggregation through the implementation of a cross-spatial information aggregation approach that spans various spatial dimensions, and more distinguishable features are generated. This mechanism facilitates discriminative target feature learning with higher efficiency, thereby augmenting the model’s representational capacity for complex underwater scenes.

For a sonar-derived feature map

X \in R^{C \times H \times W}

, the EMA is employed to partition the feature map X into g subgroups along the channel axis. As a result, it can capture diverse semantic information. The mathematical logic behind this grouping strategy is presented as

X = [X_{0}, X_{1}, \dots, X_{g - 1}], X_{i} \in R^{C / / g \times H \times W}, g ≪ C

(1)

Then, three parallel subnetworks are utilized in the EMA module to derive attention weights for grouped features, enabling spatial attention modeling. Among these, two subnetworks implement 1D global average pooling to capture the global context across both horizontal and vertical dimensions, resulting in directional encoded features. Subsequently, the extracted features are concatenated along the height axis and processed through a 1 × 1 convolution to preserve dimensionality. The features generated by the convolutional operation are bifurcated into two vectors and subsequently processed through a Sigmoid activation function to model a bivariate binomial distribution characterized by the linear convolution outputs. The channel attention representations across different groups are then integrated through element-wise multiplication, producing inter-channel correlation features. The intermediate feature map

X_{1}

can be mathematically expressed as

X_{1} = σ_{w} (F_{1} [G A P {(X)}_{H}, G A P {(X)}_{W}) \cdot σ_{H} (F_{1} [G A P {(X)}_{H}, G A P {(X)}_{W}]),

(2)

where

G A P

denotes the global average pooling operation, while

σ_{W}

and

σ_{H}

signify the nonlinear activation functions applied across the width and height dimensions, respectively. The term

F_{1}

represents the 1 × 1 convolution operation,

[., .]

denotes the concatenation process, and · represents multiplication aggregation. The third network uses a single

3 \times 3

convolution to capture the multi-scale features

X_{2}

.

Next, the global spatial context captured by the 1 × 1 convolution path is compressed using 2D GAP. The feature representation obtained by the 1 × 1 path is then element-wise multiplied with the feature matrix obtained from the 3 × 3 path through the softmax activation function to ensure dimension compatibility, resulting in the initial spatial attention feature map. At the same time, 2D GAP is used to extract comprehensive spatial statistics from the output of the 3 × 3 convolution path. The pooled representation is then matrix multiplied with the features obtained from the 1 × 1 branch through the softmax activation function to achieve dimension consistency, resulting in the second spatial attention feature map

X_{4}

. These operations can be mathematically formulated as

X_{2} = F_{3} (X),

(3)

X_{3} = G A P (X_{1}) \times S (X_{2}),

(4)

X_{4} = S (X_{1}) \times G A P (X_{2}),

(5)

where

F_{3}

represents a 3 × 3 convolution. The enhanced EMA architecture utilizes three concurrent processing streams to derive attention representations from feature maps, capturing cross-channel relationships to encode spatial information while maintaining spatial integrity. Here,

S (X)

denotes the implementation of the softmax normalization function, and the symbol × signifies the element-wise tensor product operation.

Finally, the spatial attention weights generated by the two paths are fused using an element-by-element weighted summation strategy, and mapping is reconstructed for each feature subgroup. After applying a Sigmoid function, the EMA feature map

X_{o u t p u t}

, which has the same size as X, is output:

X_{o u t p u t} = σ (X_{3} \cdot X_{4})

(6)

As mentioned above, the EMA module adopts a multi-scale context-aware approach to elevate the localization accuracy of the feature space while establishing cross-layer feature associations, effectively optimizing the representation robustness of key target features. Deploying the EMA module in forward-looking sonar imaging systems can significantly enhance target detection capabilities. Additionally, inspired by the residual structure, a new connection method is designed to learn features from various layers. The network can use information from different layers more flexibly and enhance the interactive optimization of cross-layer features. The design of the new connection method can better train deep models and obtain the final feature

X_{f i n a l}

. The connection strategy is illustrated in Figure 4 and can be expressed as

X_{f i n a l} = [X, X_{o u t p u t}]

(7)

2.3. C2f-Fusion Module

Feature fusion is a core concept in deep learning, especially in object detection. It strategically combines feature maps from different layers, scales, or branches within a network to generate more robust and information-rich feature representations. Classic feature fusion has laid an important foundation for modern deep learning models. Concatenation directly concatenates multiple feature maps along the channel dimension [56]. This fully preserves all feature information and provides the richest information. However, it increases the number of channels, significantly enlarging the computational overhead and parameter count of subsequent convolutional layers. Element-wise sum/average [57] directly adds or averages the pixel values at corresponding locations, maintaining the size and number of channels of the feature maps. This approach is computationally effective and efficient. However, it requires the identical sizes and channels for the feature maps to be fused. This approach is inherently content-independent, treating all feature regions equally and assuming that all locations and channels contribute equally. This is a significant drawback in sonar images with high noise and complex backgrounds. The feature pyramid network (FPN) [58] upsamples high-level, strong semantic features and then fuses them with low-level, high-resolution features from the backbone layer through element-wise addition. This pioneering approach addresses the multi-scale problem in object detection, resulting in a clear structure. However, the fusion operation remains a simple addition, primarily conveying semantic information. This dilutes the shallow, precise positioning information as it is passed upward. The Path Aggregation Network (PANet) [59], after FPN fusion, then fuses the underlying features upward through methods such as down sampling, strengthening the positioning information across the entire feature pyramid. This enhances both semantic and positioning information, resulting in superior results compared to the FPN. However, the fusion operation, typically addition or concatenation, remains fixed and static, lacking adaptive adjustment based on the input features.

Due to the complex seabed environment, forward-looking sonar images often have a low resolution and are fuzzy. The shape and texture of the targets may be similar to those of the surrounding environment, resulting in unclear feature expression in previous sonar images. The inherent ambiguity in underwater imagery poses substantial obstacles for conventional detection systems in achieving precise target identification and categorization. To overcome the complexities of marine environments, sophisticated feature integration techniques are crucial for improving model resilience and recognition precision. In the YOLOv8 framework, the C2f module incorporates multiple bottleneck blocks, each consisting of a pair of convolutional layers designed to transform the input feature maps into hierarchical representations. The fundamental role of the C2f architecture lies in optimizing model performance and classification accuracy through hierarchical feature learning. By integrating residual bottleneck blocks, the module enables a more efficient extraction of multi-scale contextual information. However, feature fusion in C2f is achieved through simple splicing, which may not yield optimal results. The iAFF iterative attention feature fusion module, illustrated in Figure 5, introduces the multi-scale channel attention module (MS-CAM) to more effectively fuse semantically and scale-inconsistent features. This architecture mitigates the bottleneck caused by suboptimal initial feature integration, ensuring high-quality initial representations for downstream refinement [60,61,62].

The input feature map

X \in R^{C \times H \times W}

is defined. This feature map can be used to calculate the channel attention weight

F_{L} (X)

for local features and

F_{G} (X)

for global features through point-by-point convolution in two branches. The weights

F_{L} (X)

and

F_{G} (X)

are then fused to obtain the multi-scale channel attention weight

M (X)

, which can be expressed as

F_{L} (X) = B (f_{2} (δ (B (f_{1} (X))))),

(8)

F_{G} (X) = B (f_{2} (δ (B (f_{1} (G A P (X)))))),

(9)

M (X) = σ (F_{L} (X) \oplus F_{G} (X)),

(10)

where the 1 × 1 convolutions

f_{1}

and

f_{2}

serve to reduce and restore the channel dimensions of X, respectively, with the reduction factor r balancing computational cost and feature richness. Specifically,

f_{1}

downsamples X to

\frac{C}{r}

channels, while

f_{2}

reconstructs the original channel count. The scaling factor r controls the degree of channel reduction, balancing computational efficiency with feature expressiveness.

B

refers to batch normalization,

δ

is the ReLU activation,

G A P

signifies the global average pooling operation, and ⊕ denotes the broadcasting addition.

Next, the input features X and Y undergo initial feature integration, followed by the application of MS-CAM on the integrated features to generate attention weights

M (X + Y)

. Then, features X and Y are fused using the corresponding weights to obtain the output feature. This intermediate result serves as the input to the next fusion stage, culminating in the final fused feature formulated by

H = M (X ⊎ Y) \otimes X + (1 - M (X ⊎ Y)) \otimes Y,

(11)

Z = M (H) \otimes X + (1 - M (H)) Y,

(12)

where ⊎ represents the initial feature integration, and ⊗ represents the multiplication of corresponding elements. H represents the output feature from the initial feature fusion, while Z represents the result of the final fusion operation.

The iAFF structure introduced in C2f-Fusion can extract global and local features, enabling a comprehensive contextual understanding and improving the initial feature fusion quality. Figure 6 presents the proposed approach. The input

X \in R^{C \times H \times W}

first passes through a 1 × 1 convolution, generating

X_{1}

with doubled channels. The Split operation is then applied to divide

X_{1}

into two branches:

X_{2}

and

X_{3}

. Then,

X_{3}

passes through the bottleneck module to generate the output features

X_{4}

and

X_{5}

. Feature

X_{5}

is processed again by the bottleneck module to produce the output features

X_{6}

and

X_{7}

. Following this pattern,

X_{2 n + 1}

passes through the bottleneck module to produce the output features

X_{2 (n + 1)}

and

X_{2 n + 3}

, which can be expressed as

X_{1} = F_{1} (X),

(13)

X_{2}, X_{3} = S p l i t (X_{1}),

(14)

X_{4}, X_{5} = S p l i t (B o t t l e n e c k (X_{3})),

(15)

X_{2 (n + 1)}, X_{2 n + 3} = S p l i t (B o t t l e n e c k (X_{2 n + 1})), n = 1, 2, 3, 4, \dots

(16)

where

F_{1}

represents a 1 × 1 convolution. Split refers to an operation that distributes the output of the module to different subsequent layers for varied calculations. Bottleneck refers to an operation designed to enhance the learning of complex features. After the input feature X experiences a 1 × 1 convolution, it then enters a Split operation. Finally, it is processed by multiple bottleneck modules, and the feature maps

X_{2}, X_{3}, X_{4}, X_{6}, \dots, X_{2 (n + 1)}, X_{2 n + 3}

are obtained.

Then, the feature maps from

n + 2

branches are fused using the iAFF mechanism to generate

X_{output}

, which is then transformed via a 1 × 1 convolution to produce

X_{final}

. The mathematical expression for this process is

X^{'} = X_{2} ⊎ X_{3} ⊎ \dots ⊎ X_{2 n + 3},

(17)

X^{″} = M_{2} (X^{'}) \otimes X_{2} + M_{3} (X^{'}) \otimes X_{3} + \dots + M_{2 n + 3} (X^{'}) \otimes X_{2 n + 3},

(18)

X_{output} = M_{2} (X ″) \otimes X_{2} + M_{3} (X ″) \otimes X_{3} + \dots + M_{2 n + 3} (X ″) \otimes X_{2 n + 3},

(19)

X_{f i n a l} = F_{1} (X_{o u t p u t}),

(20)

where

X^{'}

represents the initial feature integration operation of

X_{i}

,

M_{i} (X^{'})

represents the MS-CAM weight of the input feature

X^{'}

,

M_{2} (X') + M_{3} (X') + \dots M_{2 n + 3} (X') = 1

.

X^{″}

represents the result after the initial feature fusion in the iAFF module,

M_{i} (X^{″})

represents the MS-CAM weight of the input feature

X^{″}

, and

M_{2} (X ″) + M_{3} (X ″) + \dots M_{2 n + 3} (X ″) = 1

. i takes 2, 3, 4…2(n + 1), 2n + 3. The symbol

X_{output}

refers to the feature map generated through sequential fusion iterations, whereas

X_{final}

represents the output of the C2f-Fusion module after the final processing.

In summary, by integrating the interactive attention fusion (iAFF) module, the C2f-Fusion architecture simultaneously captures global and local features, enhances contextual awareness, and improves the initial feature integration quality. This results in more discriminative target representations for forward-looking sonar imagery.

3. Experimental Results and Analysis

3.1. Sonar Image Dataset

Sonar image detection technology can be widely used in many fields such as industry, environment, and military, and has significant value [63]. Adopted in our study, the URPC2021 sonar dataset serves as the authorized data source for the 2021 National Underwater Acoustic Detection Competition. Released by Pengcheng Laboratory, it represents the most comprehensive and largest-scale acoustic imagery dataset to date. The dataset consists of 4000 high-resolution images acquired via a forward-looking sonar system, which simultaneously captures target distance, azimuth, height, and intensity information. These images include eight distinct object classes: human body, ball, circle cage, square cage, tyre, metal bucket, cube, and cylinder. The specific target distributions are presented in Figure 7. To facilitate model training and performance assessment, the dataset was systematically partitioned into three distinct components, namely, a training set, a validation set, and a test set, with a sample distribution ratio of 8:1:1 [64].

3.2. Experimental Setup and Model Training

The hardware for this experiment consisted of an RTX 3090 Ti graphics card with 24 GB of VRAM and an Intel (R) Core (TM) i9-10980XE CPU with a base frequency of 3.00 GHz. The software environment was Windows 10, and the deep learning framework was PyTorch. The specific version details are as follows: Python 3.8, PyTorch 1.10.0, and CUDA 11.3. For model training, the batch size was set to 64, and the input image size was 640 × 640. The optimizer employed stochastic gradient descent (SGD), with a weight decay of 5 × 10⁻⁴, an initial learning rate of 0.01, and a total training duration of 150 epochs. The data augmentation technique utilized was the Mosaic method.

3.3. Performance Evaluation Indicators

The selection of suitable evaluation indicators is fundamental to quantifying model performance in FLS image target detection. This work uses a diverse range of metrics to comprehensively evaluate the proposed framework. Params: the total trainable parameters, a measure of model complexity. mAP: unified accuracy measure across all categories. IoU: IoU quantifies the degree of overlap between the predicted box and the true box by calculating the ratio of their intersection area to the union area. mAP50: the mean average precision at an IoU threshold of 0.5. mAP50−95: the mean average precision for 10 discrete IoU thresholds (0.5 to 0.95, with a step size of 0.05). FPS: frames processed per second, reflecting real-time inference speed. The mathematical definitions are provided in Equations (21)–(26):

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n},

(21)

m A P = \frac{1}{n} \underset{i = 1}{\sum^{n}} A P_{i},

(22)

A P = \int_{0}^{1} P (r) d r,

(23)

P = \frac{T P}{T P + F P},

(24)

R = \frac{T P}{T P + F N},

(25)

F P S = \frac{1}{T},

(26)

where

A P_{i}

represents the detection accuracy of the i-th type of target, which is calculated using precision P and recall R.

T P

represents the actual number of positive samples correctly detected by the model as positive examples, reflecting the accuracy of target recognition.

F P

represents the actual number of negative samples that the model incorrectly labeled as positive examples, quantifying the false alarm rate.

F N

represents the actual number of positive samples that the model failed to detect, reflecting the degree of missed detection. T indicates the inference time per input sample (in seconds). n represents the number of categories that need to be detected.

3.4. Experimental Results

Multiple ablation studies and comparison experiments were conducted to evaluate how the introduced improvements affect model performance, thereby evaluating the efficacy of our method. We used the URPC2021 sonar dataset for these two sets of experiments. Before the experiments, we first performed image preprocessing on the original dataset to enhance the image quality. This process included contrast enhancement and image denoising. Histogram equalization was employed to redistribute the grayscale intensity values, thereby expanding the dynamic range and enhancing image contrast. On the basis of contrast enhancement, we introduced the Block Matching 3D (BM3D) [65] algorithm for sonar image denoising, and we constructed a three-dimensional group by finding similar image blocks through block matching. The three-dimensional transform domain threshold shrinkage was used to suppress noise, and joint denoising was achieved by combining collaborative filtering. The denoising effect was optimized through two iterations, effectively suppressing Gaussian noise while retaining image details. In the ablation experiments, we gradually replaced the key components to observe their specific influence on classification efficacy. Simultaneously, in the comparative experiments, we comprehensively compared our improvement strategy with existing mainstream methods. Through comprehensive comparative experiments conducted on a standardized dataset under controlled conditions, it was found that our methodology demonstrated superior performance in both precision and stability. The empirical findings are systematically presented through a dual approach encompassing statistical metrics and visual interpretation, thereby substantiating the effectiveness of our enhanced framework while offering innovative perspectives for the research community.

3.4.1. Image Denoising Strategy

The primary noise type in forward-looking sonar images is speckle noise, a multiplicative noise whose formation is closely related to the physical mechanisms of imaging. This noise originates from the interference of coherent waves during the imaging process and manifests as a fine, granular texture in the image, leading to image quality degradation. Its statistical characteristics exhibit high local correlation, rather than white noise, and its variance is correlated with signal intensity. Spatially, it exhibits a distinct granular texture structure that is highly correlated with local image content. The BM3D algorithm is applicable to various noise types and performs well in processing speckle noise. It can effectively suppress varying noise levels by adaptively adjusting filter parameters. Denoising effectiveness is typically evaluated by combining objective metrics (such as the signal-to-noise ratio (SNR) and structural similarity (SSIM)) with subjective expert visual evaluation.

The BM3D algorithm used in this paper uses the inherent self-similarity of the image to perform denoising. Its core principle involves identifying image patches that are similar to a given reference patch, grouping these similar patches into a 3D array, applying collaborative filtering within this 3D group, and then aggregating the results back into the original image domain. While conceptually related to the non-local means (NLM) algorithm in its utilization of similar patches for denoising, BM3D employs a significantly more complex processing framework.The BM3D algorithm operates in two sequential stages:

Stage 1: Hard-Thresholding (Basic Estimation)

This stage processes the noisy image on a patch-by-patch basis. For each reference patch it does the following: Grouping: Patches similar to the current reference patch are identified within the noisy image. These matching patches are stacked together to form a 3D group. Collaborative Hard-Thresholding: The formed 3D group undergoes a 3D transform. A hard-thresholding operation is applied to the transform coefficients to attenuate noise. The thresholded coefficients are then inversely transformed, yielding denoised estimates for each patch within the group. These estimates are returned to their original spatial locations. Aggregation: All overlapping denoised patch estimates obtained throughout the image are combined using a weighted average. This aggregation produces the basic estimate of the underlying clean image.

Stage 2: Wiener Filtering (Final Estimation)

Utilizing the basic estimate obtained in Stage 1, this stage performs another round of patch-wise processing: Grouping: For each reference patch, block-matching is performed on the basic estimate image to locate similar patches. Based on these locations, corresponding patches are extracted from both the original noisy image and the basic estimate image, forming two separate 3D groups. Collaborative Wiener Filtering: Both 3D groups undergo a 3D transform. The energy spectrum of the transformed basic estimate group is used as an approximation of the true signal spectrum. A Wiener filter is then applied to the transform coefficients of the noisy image group. The filtered coefficients are inversely transformed, producing refined estimates for each patch. These estimates are returned to their original locations. Aggregation: All refined overlapping patch estimates are combined via a weighted average. This final aggregation yields the algorithm’s ultimate estimate of the underlying clean image.

3.4.2. Ablation Experiment

This section details the extensive ablation experiments conducted to methodically assess the performance differences between the improved network configuration and its original version for sonar image recognition applications. The detailed evaluation metrics obtained from this systematic comparison are numerically summarized in Table 1.

Experiment 1 represents the target detection test using the original YOLOv8 model. In contrast, Experiments 2–8 introduce incremental modifications to the model parameters and complexity. Experiments 2–4 involve the integration of CoordConv, the EMA attention mechanism, and the C2f-Fusion module into the YOLOv8 model, respectively. The experimental results show that mAP50 increases by 2.4%, 2.2%, and 2.1%, respectively, and mAP50-95 increases by 9.8%, 9.9%, and 8.5%, respectively. Experiments 5–7 involve the simultaneous integration of multiple modified modules. The results demonstrate that combining modules can enhance mAP50, mAP50-95, or both. Compared to Experiment 1, Experiment 8 shows an increase of 2.4% in mAP50 and 10.4% in mAP50-95. The integration of CoordConv in Experiments 2, 6, 7, and 8 results in significant model detection speed reductions. When compared with the baseline YOLOv8 architecture, the newly developed RCF-YOLOv8 framework presented in this study exhibits significantly improved classification performance.

We also compared the model performance of YOLOv8 and RCF-YOLOv8 in terms of the F1-score, recall, and confusion matrix. Figure 8 shows the F1-score curves of Experiments 1–8. The F1-score is calculated as

F 1 = 2 \times \frac{P \times R}{P + R}

[38]. The F1 metric serves as a balanced measure combining precision and recall, quantifying model performance on a normalized scale from 0 to 1. Higher values indicate stronger model robustness and accuracy. The curves of Experiment 8 are closer to those of 1 and remain longer in the higher F1-score range, indicating better model performance. Figure 9 shows the recall curves of Experiments 1–8. The recall rate quantifies a classifier’s capacity to correctly identify target class instances [38]. A higher recall rate indicates a stronger detection capability for positive samples. In comparison with Experiment 1, Experiments 2–8 exhibit significant improvements and stability in recall rate performance, with Experiment 8 showing the highest recall rate score. Figure 10 shows the confusion matrices for Experiments 1–8. Confusion matrices serve to visualize classifier performance [66]. The diagonal entries of this matrix denote the number of accurately classified samples, whereas the non-diagonal entries correspond to misclassifications. The number of correctly predicted samples in Experiments 2–8 is significantly higher than in Experiment 1. The number of correctly predicted samples in Experiments 1–8 is 1394, 2578, 2582, 2579, 2578, 2566, 2578, and 2585, respectively. Considering both correctly and incorrectly predicted samples across the eight categories, Experiment 8 shows the best performance. Collectively, the proposed RCF-YOLOv8 architecture demonstrates superior detection capabilities compared to the base network in underwater object recognition applications.

The role of feature extraction in the convolution layer is crucial. In deep learning, feature extraction usually refers to the process of automatically learning and extracting useful information from raw data. For target detection tasks, feature extraction can be divided into shallow feature extraction and deep feature extraction: Shallow features usually correspond to low-level features of the image, such as edges, corners, etc. These features are very important for distinguishing basic shapes in the image. Shallow features have a high spatial resolution, but may lack the ability to describe complex patterns. Deep features: As the network level deepens, the convolution layer can extract more abstract and complex features, such as parts and textures of objects. Deep features usually help to identify higher-level semantic information, which is crucial for the final target detection and classification. By stacking multiple convolutional layers, RCF-YOLOv8 can learn this low-level to high-level feature hierarchy from the input image data, thereby improving the model’s ability to detect targets.

We visualize the channel feature maps of certain layers in the YOLOv8 model and RCF-YOLOv8, as shown in Figure 11. Channel visualization helps us understand the features learned by the model at a specific level. Visualizing the activation of each channel reveals how the network processes input data and the key features that it emphasizes.

As demonstrated in Figure 11, conducting sensitivity and predictability analysis using the optimal model’s weight matrix enables the validation of whether the proposed mechanism in this study focuses on sonar-relevant features such as object contours and shadow zones [67]. Such a strategy can also reveal potential biases like overreliance on specific scales or channels. This insight supports both targeted improvements to the feature fusion strategy and the identification of robust attributes that are suitable for model compression or deployment in resource-constrained underwater environments. By visualizing activation maps across different channels, we found that RCF-YOLOv8 is more responsive to object edges and acoustic shadows, while the baseline YOLOv8 model is more susceptible to background interference. Furthermore, the EMA module effectively mitigates overfitting at specific scales, improving the model’s robustness in complex backgrounds.

Input targets with clear spatial position information. CoordConv explicitly encodes position information by adding coordinate channels and is sensitive to the boundary and center position of the target. Input multi-scale salient targets. EMA’s cross-spatial attention mechanism will focus on the high response areas of the channel and spatial dimensions. Sparse targets or cross-scale features are more likely to activate the output. Input targets with global–local feature coupling. The iAFF mechanism fuses global context with local details through iterative attention, and such an input will maximize the activation weight.

A performance comparison of YOLOv8 and RCF-YOLOv8 on the uniform validation dataset is shown in Figure 12. In Figure 12a,b, the YOLOv8 network exhibits clear omissions and false detections. In Figure 12c, the confidence of the RCF-YOLOv8 prediction result is higher. A higher confidence indicates greater certainty in the prediction result. This indicates challenges in the YOLOv8 model’s predictions in complex sonar image backgrounds. These issues affect the overall system performance and may lead to inaccurate target recognition in practical applications, impacting subsequent decisions and operations. RCF-YOLOv8 effectively reduces false positives and false negatives in YOLOv8-based forward-looking sonar detection while significantly improving target classification accuracy.

3.4.3. Comparative Experiment

To substantiate the enhanced detection capabilities of our novel algorithm, we employed a standardized dataset for both model training and performance assessment. We then compared the results of the following models: Faster R-CNN [17], SSD [18], YOLOv3-tiny [22], YOLOv4-tiny [23], YOLOv5s [28], YOLOv7s [25], YOLOv8s [26], and RCF-YOLOv8. A comparative analysis of detection precision between mainstream architectures and our novel approach is quantitatively shown in Table 2.

RCF-YOLOv8 proposed in this study has a lower model complexity than the Faster R-CNN [17] and SSD [18] algorithms, and it shows greater advantages in detection accuracy. Compared with YOLOv3-tiny [22], YOLOv4-tiny [23], YOLOv5s [28], YOLOv7s [25], and the original YOLOv8s [26], RCF-YOLOv8 has a higher model complexity but shows obvious advantages in detection accuracy. Faster R-CNN [17] employs the region proposal network (RPN) mechanism to generate potential object detection regions. Owing to the intricate characteristics of sonar imagery, the RPN may produce a substantial quantity of imprecise candidate bounding boxes, which can lead to reduced accuracy in subsequent classification and regression stages. SSD [18] relies on predefined anchor boxes, but the shapes and scales of objects in sonar images are diverse, and the selection of anchor boxes may not be flexible enough, affecting the detection results. YOLOv3 [22] incorporates a feature pyramid network (FPN) structure to facilitate detection across diverse object dimensions. Its mAP50 is 31.5% and 4.2% higher than that of Faster R-CNN [17] and SSD [18], respectively, and its mAP50-95 is 27.4% and 5.2% higher. YOLOv4 [23] introduces several architectural enhancements that significantly boost its detection performance. The optimized CSPDarknet53 backbone network demonstrates superior feature extraction capabilities, while the CIoU loss function provides more precise object localization. The implementation of the FPN+PAN architecture facilitates the effective integration of both low- and high-level feature representations, thereby improving detection accuracy. Furthermore, the adaptive anchor box mechanism enables robust detection across various object scales. In this experiment, YOLOv4 [23] does not achieve the expected effect. There are two main reasons for this: the YOLOv4-tiny [23] model has low parameters and an insufficient depth, resulting in poor feature extraction. Compared with YOLOv3 [22], its mAP50 decreases by 1.1%, and its mAP50-95 decreases by 3.4%. YOLOv5 [28] has a lighter architecture and uses the Focus module to extract more detailed features. It also optimizes the data augmentation technology and boosts the model’s generalization performance across diverse scenarios. Compared with YOLOv4 [23], YOLOv5’s [28] mAP50 increases by 1.5%, while its mAP50-95 decreases by 1.8%, indicating that YOLOv5 may have difficulty detecting targets with a high IoU in forward-looking sonar images, resulting in a decrease in mAP50-95. In YOLOv7 [25], the E-ELAN structure improves the network’s learning proficiency and makes the feature expression of the target richer. However, the anchor-based strategy of YOLOv7 [25] may not adapt well to targets in forward-looking sonar images, resulting in a decrease in detection performance. Compared with YOLOv5 [28], its mAP50 decreases by 5.7%, and its mAP50-95 decreases by 7.4%. YOLOv8 [26] adopts an anchor-free strategy, eliminating the limitations of anchor box selection and matching, making the model more flexible and accurate in target detection. By replacing the C3 component with the C2f structure in YOLOv8’s backbone, the discriminative representation of target features is significantly improved. In addition, the Transformer module provides a powerful global modeling capability to capture long-distance dependencies in images, which is crucial for target recognition in complex scenes. Compared with YOLOv7 [25], YOLOv8’s [26] mAP50 increases by 6.7%, and its mAP50-95 increases by 18.4%. Based on YOLOv8 [26], RCF-YOLOv8 replaces Conv with CoordConv to capture spatial information and introduces the EMA module to focus on capturing the scarce features of sparse targets, and the C2f-Fusion module optimizes image feature fusion and improves the quality of initial feature fusion. Compared with YOLOv8 [26], the mAP50 of RCF-YOLOv8 is improved by 2.4%, and its mAP50-95 is improved by 10.4%. The empirical findings demonstrate that the novel RCF-YOLOv8 framework developed in this study exhibits superior detection capabilities, effectively addressing the challenges of underwater target recognition.

We tested Faster R-CNN [17], SSD [18], YOLOv3-tiny [22], YOLOv4-tiny [23], YOLOv5s [28], YOLOv7s [25], YOLOv8s [26], and RCF-YOLOv8 on the same test set, and the detection results are shown in Figure 13. As can be seen in this figure, in the forward-looking sonar image target classification task, only RCF-YOLOv8 correctly classified the target with a relatively high confidence level. In contrast, the classification results of Faster R-CNN [17] contained more serious false detections and missed detections. The results of SSD [18] not only contained false detections and missed detections but also exhibited overlapping detection boxes. YOLOv3-tiny [22] had fewer missed detections. YOLOv4-tiny [23] produced classification results with significant detection inaccuracies, including frequent false alarms and target omissions. YOLOv5s [28] achieved correct classification for only a small subset of target classes, indicating a poor discriminative capability. In the case of YOLOv7s [25], detection errors were particularly pronounced, leading to substantial false positives and missed targets. The classification results of YOLOv8s [26] were better, but false detection phenomena were still observed. Consequently, for target recognition in specialized complex environments, it is essential to design optimized network structures and tailored learning methodologies to enhance recognition precision. The comparative performance metrics of the aforementioned experiments are illustrated in Figure 14.

A comprehensive evaluation of the model detection capabilities is illustrated in Figure 14, where Figure 14a presents the mAP50 metric, indicating the mean average precision at a 0.5 intersection-over-union threshold, while Figure 14b demonstrates the mAP50-95 metric, encompassing precision measurements across IoU thresholds ranging from 0.5 to 0.95 in 0.05 increments. This holistic measurement offers a reliable assessment framework for evaluating model effectiveness in diverse detection contexts. The experimental results reveal a consistent improvement in mAP values with increasing training iterations, eventually reaching convergence. The comparative analysis clearly demonstrates that RCF-YOLOv8 has significant advantages and surpasses other comparison models in terms of both mAP50 and mAP50-95 indicators, highlighting its exceptional capability in target recognition in FLS imagery in challenging underwater environments.

3.5. Verification of the Generalization Ability of Classification Models

In order to verify the generalization of RCF-YOLOv8, this paper constructs six datasets with different noises based on the URPC2021 sonar dataset with contrast enhanced by histogram equalization to simulate complex ocean environments [68]: two datasets add speckle noise (with variances of 0.2 and 1.0, respectively), two datasets add Gaussian noise (with variances of 0.01 and 0.12, respectively), and two datasets add impulse noise (with variances of 0.02 and 0.10, respectively) to simulate the different ocean and sea areas, as shown in Figure 15. The model is trained and tested on each dataset, and the prediction results are shown in Figure 16.

In Figure 16, (a) represents the validation of different test sets using the training set constructed with speckle noise with a variance of 0.2, (b) represents the validation of different test sets using the training set constructed with speckle noise with a variance of 1.0, (c) represents the validation of different test sets using the training set constructed with Gaussian noise with a variance of 0.01, (d) represents the validation of different test sets using the training set constructed with Gaussian noise with a variance of 0.12, (e) represents the validation of different test sets using the training set constructed with impulse noise with a variance of 0.02, and (f) represents the validation of different test sets using the training set constructed with impulse noise with a variance of 0.10. By introducing speckle noise, Gaussian noise, and impulse noise, and setting noise parameters reasonably, the generalization ability of RCF-YOLOv8 under different noise conditions can be effectively evaluated. The quantitative evaluation results in Figure 16 clearly show that the RCF-YOLOv8 model can maintain high prediction accuracy and result stability in a variety of test scenarios, and the prediction deviation is low and controllable, which strongly proves its excellent model generalization performance.

4. Discussion

4.1. Acoustic Image Object Detection Network

Underwater acoustic image object detection is an important research direction at the intersection of ocean engineering and computer vision. Although deep learning-based object detection methods have achieved high detection accuracy and real-time performance in natural scene images, the unique characteristics and complexity of the underwater environment, including a low signal-to-noise ratio, blurred target outlines, severe water scattering interference, and the confusion of background clutter and target features, significantly restricts the robustness and generalization of detection algorithms. Currently, two key approaches to improving neural network detection performance lie in enhancing feature extraction capabilities and optimizing feature fusion mechanisms. This paper, based on the YOLOv8 framework, introduces the EMA attention mechanism and the C2f-Fusion feature fusion module, and employs the CoordConv convolution operation to construct a specialized object detection network for forward-looking sonar images. Ablation experiments and comparative results demonstrate that the proposed method significantly improves detection accuracy while effectively reducing false detection and missed detection rates.

4.2. Challenges in Underwater Object Detection

Currently, the limited number of acoustic image samples and the lack of high-quality annotated data remain key issues hindering underwater object detection performance. Insufficient samples significantly limit the generalization ability of deep learning models in complex underwater scenarios. Although the algorithm proposed in this paper performs well in terms of accuracy, its overall performance is still affected by the diversity of target categories and sonar images in the current dataset. Future work will focus on integrating more forward-looking sonar image samples to expand the dataset corpus, further strengthening the algorithm’s validation reliability and generalization ability. Furthermore, balancing detection accuracy and inference speed remains a key challenge. The addition of modules in this study resulted in a slight increase in model complexity, which in turn decreased detection speed. Future research will focus on lightweight technologies to improve inference efficiency while maintaining detection accuracy, thereby meeting the dual requirements of high accuracy and real-time performance in practical applications.

4.3. Generalization and Interpretability Analysis

To verify the generalization of the proposed method, we introduced noises of varying types and degrees to the original dataset to simulate acoustic imaging environments under various sea conditions. The trained model demonstrated good robustness and generalization across multiple test sets. However, the applicability of this framework to other sonar imaging modes, such as sidescan sonar and synthetic aperture sonar, remains to be further tested. Furthermore, the model’s scalability in more complex, large-scale, real-world underwater scenarios requires systematic evaluation, addressing issues such as the potential for increased false alarm rates under conditions of significant target scale variation or low contrast. Future research will explore enhancing multi-scale detection mechanisms to maintain high accuracy for targets of varying sizes and in diverse environmental conditions, and exploring online adaptive techniques to enable real-time model tuning in unknown environments. Furthermore, while ensuring detection accuracy, improving the credibility and interpretability of the training process is also a key challenge facing deep learning. Integrating automated Explainable AI (XAI) [69,70] tools is expected to provide a clearer basis for model decision-making, and related research warrants further exploration.

5. Conclusions

In this study, we present an advanced detection framework for FLS seabed targets, called RCF-YOLOv8, which is built on the YOLOv8 architecture. The proposed approach incorporates three key innovations: By exploiting the spatial information that is inherent in FLS imagery, traditional convolution layers are replaced with CoordConv to reduce false-positive and miss rates in seabed target detection. The backbone network is augmented with an EMA module to strengthen the feature extraction capabilities. This modification improves the model’s learning capacity and target representation, enabling it to prioritize sonar image targets while enhancing detection accuracy and efficiency. A novel connection mechanism is introduced to effectively leverage hierarchical feature representations while minimizing feature loss during the feature extraction process. The C2f-Fusion module is adapted for FLS target detection, optimizing feature fusion quality and improving the initial integration performance. Experimental validation on the URPC2021 dataset demonstrates significant improvements, achieving an mAP50 of 98.8% and an mAP50-95 of 67.6%. Subsequently, different noises were added to the data set to simulate different marine environments to verify the generalization ability of the proposed classification model.

In future work, we will deepen our research on three core areas: dataset expansion and model lightweighting, and promoting their practical application in underwater acoustic image target detection. Specifically, this research will explore the following dimensions: Regarding dataset construction, we will collaborate on multi-device acquisition and synthesize highly realistic simulated samples using generative adversarial networks (GANs) and diffusion models. Regarding model lightweighting, we will focus on efficient acoustic feature extraction and streamlined network architecture design: enhancing the model’s ability to focus on key features of blurred targets; reducing network parameter size and computational complexity while ensuring detection accuracy; and leveraging model compression techniques (pruning and quantization) to achieve a lightweight network architecture on embedded platforms, combined with quantization-aware training, enables real-time inference deployment and hardware acceleration via TensorRT or a dedicated NPU. In terms of model optimization, XAI is used to provide explainability for the deep learning training process and to establish a reliable real-time sonar image target detection system. Finally, we will work with the latest YOLO version to further verify the effectiveness of the new module proposed in this work.

Author Contributions

Conceptualization, methodology, and software, X.L. (Xiaoxue Li); validation, X.L. (Xiaoxue Li), X.L. (Xueqin Liu), Y.C., Z.Q., J.W. and Q.Y.; writing—original draft preparation, X.L. (Xiaoxue Li); writing—review and editing, X.L. (Xueqin Liu), Y.C., Z.Q., J.W. and Q.Y.; visualization, X.L. (Xiaoxue Li); supervision, X.L. (Xueqin Liu); project administration, X.L. (Xueqin Liu); funding acquisition, X.L. (Xueqin Liu) and Z.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (42106070) and the Shandong Provincial Natural Science Foundation (ZR2024ME233).

Data Availability Statement

The datasets presented in this paper are available at https://openi.pcl.ac.cn/OpenOrcinus_orca/URPC2021_sonar_images_dataset, accessed on 26 November 2023.

Acknowledgments

The authors are sincerely grateful to Shengqi Yu, Benjun Ma, Yongzheng Liu, and Wenjian Lan at Harbin Engineering University for their expert guidance in sonar image processing methodologies and generous assistance in refining the English expression of this manuscript during the revision process. Their insightful suggestions significantly enhanced the analytical rigor and linguistic quality of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, G.; Cong, H.; Zhao, M.; Gong, Z. Underwater sonar image detection algorithm based on corner. In Proceedings of the 2nd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2022), Virtual, 25–27 February 2022; Volume 12348, pp. 336–343. [Google Scholar]
Li, Y.; Ye, X.; Zhang, W.; Liu, W. DCSP-Yolov5: Improved Yolov5 Based on Dilated Convolution for Object Detection of Forward-Looking Sonar Images. In Proceedings of the OCEANS 2022, Hampton Roads, VA, USA, 17–20 October 2022; pp. 1–5. [Google Scholar]
Li, Z.; Xie, Z.; Duan, P.; Kang, X.; Li, S. Dual Spatial Attention Network for Underwater Object Detection with Sonar Imagery. IEEE Sens. J. 2024, 24, 6998–7008. [Google Scholar] [CrossRef]
Li, D.; Qu, D.; Li, X.; Li, L.; Gao, Q.; Yu, X. Lightweight global adaptive feature enhancement network for underwater object detection with sonar image. J. Phys. Conf. Ser. 2024, 2914, 012023. [Google Scholar] [CrossRef]
Yuanzi, L.; Xiufen, Y.; Weizheng, Z. Transyolo: High-performance object detector for forward looking sonar images. IEEE Signal Process. Lett. 2022, 29, 2098–2102. [Google Scholar]
Zheng, L.; Hu, T.; Zhu, J. Underwater sonar target detection based on improved ScEMA YOLOv8. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 1503505. [Google Scholar] [CrossRef]
Fortes, I.S.; Araujo, J.C.; Pereira, B.S.B.; Seoane, J.C.S. Sea bottom types of a coral reef marine protected area revealed by side scan survey. In Proceedings of the 2015 IEEE/OES Acoustics in Underwater Geosciences Symposium (RIO Acoustics), Rio de Janeiro, Brazil, 29–31 July 2015; pp. 1–9. [Google Scholar]
Williams, D.P. Fast target detection in synthetic aperture sonar imagery: A new algorithm and large-scale performance analysis. IEEE J. Ocean. Eng. 2014, 40, 71–92. [Google Scholar] [CrossRef]
Song, S.B.; Liu, J.F.; Ni, H.Y.; Cao, X.L.; Pu, H.; Huang, B.X. A new automatic thresholding algorithm for unimodal gray-level distribution images by using the gray gradient information. J. Pet. Sci. Eng. 2020, 190, 107074. [Google Scholar] [CrossRef]
Bianco, M.J.; Gerstoft, P.; Traer, J.; Ozanich, E.; Roch, M.A.; Gannot, S.; Deledalle, C.A. Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 2019, 146, 3590–3628. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Szegedy, C.; Toshev, A.; Erhan, D. Deep Neural Networks for Object Detection. In Advances in Neural Information Processing Systems; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates, Inc.: Nice, France, 2013; Volume 26. [Google Scholar]
Karimanzira, D.; Renkewitz, H.; Shea, D.; Albiez, J. Object detection in sonar images. Electronics 2020, 9, 1180. [Google Scholar] [CrossRef]
Topini, E.; Fanelli, F.; Topini, A.; Pebody, M.; Ridolfi, A.; Phillips, A.B.; Allotta, B. An experimental comparison of Deep Learning strategies for AUV navigation in DVL-denied environments. Ocean. Eng. 2023, 274, 114034. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
Ren, S. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-time vehicle detection based on improved yolo v5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
Kim, J.H.; Kim, N.; Park, Y.W.; Won, C.S. Object detection and classification based on YOLO-V5 with improved maritime dataset. J. Mar. Sci. Eng. 2022, 10, 377. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Wang, Q.; Chen, M. Improved YOLOX-X based UAV aerial photography object detection algorithm. Image Vis. Comput. 2023, 135, 104697. [Google Scholar] [CrossRef]
Li, S.; Fu, X.; Dong, J. Improved ship detection algorithm based on YOLOX for SAR outline enhancement image. Remote. Sens. 2022, 14, 4070. [Google Scholar] [CrossRef]
Wang, Y.; Liu, J.; Yu, S.; Wang, K.; Han, Z.; Tang, Y. Underwater Object Detection based on YOLO-v3 network. In Proceedings of the 2021 IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 15–17 October 2021; pp. 571–575. [Google Scholar]
Li, S.; Zhang, W.; Luo, R.; Zeng, P.; Jiang, X.; Zhu, L.; Wang, Z. A Research of Deep Learning on Target Detection of Underwater Sonar Images. In Proceedings of the 2022 10th International Conference on Information Systems and Computing Technology (ISCTech), Guilin, China, 28–30 December 2022; pp. 759–765. [Google Scholar]
Xie, B.; He, S.; Cao, X. Target detection for forward looking sonar image based on deep learning. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 7191–7196. [Google Scholar]
Steiniger, Y.; Groen, J.; Stoppe, J.; Kraus, D.; Meisen, T. A study on modern deep learning detection algorithms for automatic target recognition in sidescan sonar images. In Proceedings of the Meetings on Acoustics, Virtual, 8–10 June 2021; Volume 44. [Google Scholar]
Fan, X.; Lu, L.; Shi, P.; Zhang, X. A novel sonar target detection and classification algorithm. Multimed. Tools Appl. 2022, 81, 10091–10106. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, R.; Zhan, S.; Chen, Y. Underwater target detection algorithm based on YOLO and Swin transformer for sonar images. In Proceedings of the OCEANS 2022, Hampton Roads, VA, USA, 21–24 February 2022; pp. 1–7. [Google Scholar]
Aubard, M.; Madureira, A.; Madureira, L.; Pinto, J. Real-time automatic wall detection and localization based on side scan sonar images. In Proceedings of the 2022 IEEE/OES Autonomous Underwater Vehicles Symposium (AUV), Singapore, 19–21 September 2022; pp. 1–6. [Google Scholar]
Xing, B.; Sun, M.; Liu, Z.; Guan, L.; Han, J.; Yan, C.; Han, C. Sonar Fish School Detection and Counting Method Based on Improved YOLOv8 and BoT-SORT. J. Mar. Sci. Eng. 2024, 12, 964. [Google Scholar] [CrossRef]
Yulin, T.; Jin, S.; Bian, G.; Zhang, Y. Shipwreck target recognition in side-scan sonar images by improved YOLOv3 model based on transfer learning. IEEE Access 2020, 8, 173450–173460. [Google Scholar] [CrossRef]
Peng, C.; Jin, S.; Liu, H.; Zhang, W.; Xia, H. Adversarial enhancement generation method for side-scan sonar images based on DDPM–YOLO. Mar. Geod. 2024, 47, 526–554. [Google Scholar] [CrossRef]
Qu, P.; Cheng, E.; Chen, K. Real-Time Ocean Small Target Detection Based on Improved YOLOX Network. In Proceedings of the OCEANS 2022, Hampton Roads, VA, USA, 21–24 February 2022; pp. 1–5. [Google Scholar]
Zhang, F.; Zhang, W.; Cheng, C.; Hou, X.; Cao, C. Detection of small objects in side-scan sonar images using an enhanced YOLOv7-based approach. J. Mar. Sci. Eng. 2023, 11, 2155. [Google Scholar] [CrossRef]
Zhuang, Y.; Liu, J.; Zhao, H.; Ma, L.; Fang, Z.; Li, L.; Wu, C.; Cui, W.; Liu, Z. A deep learning framework based on structured space model for detecting small objects in complex underwater environments. Commun. Eng. 2025, 4, 24. [Google Scholar] [CrossRef]
Wu, T.; Dong, Y. YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. Appl. Sci. 2023, 13, 12977. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-Based object detection method for remote sensing images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Liu, R.; Lehman, J.; Molino, P.; Petroski Such, F.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. Adv. Neural Inf. Process. Syst. 2018, 31, 9628–9639. [Google Scholar]
Lee, B.; Ku, B.; Kim, W.; Kim, S.; Ko, H. Feature sparse coding with coordconv for side scan sonar image enhancement. IEEE Geosci. Remote. Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, H.; Tian, M.; Shao, G.; Cheng, J.; Liu, J. Target detection of forward-looking sonar image based on improved YOLOv5. IEEE Access 2022, 10, 18023–18034. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Zhang, X.; Zhu, D.; Gan, W. YOLOv7t-CEBC Network for Underwater Litter Detection. J. Mar. Sci. Eng. 2024, 12, 524. [Google Scholar] [CrossRef]
Quine, W.V. Concatenation as a basis for arithmetic. J. Symb. Log. 1946, 11, 105–114. [Google Scholar] [CrossRef]
Levy, O.; Lee, K.; FitzGerald, N.; Zettlemoyer, L. Long short-term memory as a dynamically computed element-wise weighted sum. arXiv 2018, arXiv:1805.03716. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion. Remote. Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Zhao, W.; Kang, Y.; Chen, H.; Zhao, Z.; Zhao, Z.; Zhai, Y. Adaptively attentional feature fusion oriented to multiscale object detection in remote sensing images. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
Tian, Y.; Lan, L.; Guo, H. A review on the wavelet methods for sonar image segmentation. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420936091. [Google Scholar] [CrossRef]
Yang, H.; Yu, X.; Zhang, T.; Zhou, T. SSE-YOLO: A lighter and faster object detection network for small targets in sonar images. In Proceedings of the 2023 IEEE 11th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 21–22 October 2023; pp. 230–234. [Google Scholar]
Danielyan, A.; Katkovnik, V.; Egiazarian, K. BM3D frames and variational image deblurring. IEEE Trans. Image Process. 2011, 21, 1715–1728. [Google Scholar] [CrossRef]
Wen, X.; Wang, J.; Cheng, C.; Zhang, F.; Pan, G. Underwater side-scan sonar target detection: YOLOv7 model combined with attention mechanism and scaling factor. Remote. Sens. 2024, 16, 2492. [Google Scholar] [CrossRef]
Most, T.; Will, J. Sensitivity analysis using the Metamodel of Optimal Prognosis. arXiv 2024, arXiv:2408.03590. [Google Scholar] [CrossRef]
Ma, Q.; Jiang, L.; Yu, W.; Jin, R.; Wu, Z.; Xu, F. Training with noise adversarial network: A generalization method for object detection on sonar image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 729–738. [Google Scholar]
Dwivedi, R.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core ideas, techniques, and solutions. ACM Comput. Surv. 2023, 55, 1–33. [Google Scholar] [CrossRef]
Gunning, D.; Aha, D. DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 2019, 40, 44–58. [Google Scholar]

Figure 1. Network structure diagram of RCF-YOLOv8.

Figure 2. Structure diagram of Coordconv.

Figure 3. EMA module structure diagram.

Figure 4. Connection of EMA modules in the model.

Figure 5. iAFF module structure diagram.

Figure 6. Comparison of C2f structure before and after change, (a) C2f structure, and (b) C2f-Fusion module.

Figure 7. Dataset categories and numbers.

Figure 8. Panels (a–h) show the F1-score curves of Experiments 1–8.

Figure 9. Panels (a–h) show the recall rate curves of Experiments 1–8.

Figure 10. Panels (a–h) show the confusion matrices of Experiments 1–8.

Figure 11. Some channel visualization feature maps: the feature maps of (a) the YOLOv8 model with Conv, (b) the RCF-YOLOv8 model with CoordConv, (c) the YOLOv8 baseline model, (d) the RCF-YOLOv8 model with the EMA module, (e) the YOLOv8 model with the C2f structure, and (f) the RCF-YOLOv8 model with the C2f-Fusion module.

Figure 12. Comparison of some classification results between YOLOv8 and RCF-YOLOv8. The top figure presents the target detection results of RCF-YOLOv8, and the bottom figure presents the target detection results of YOLOv8: (a) the missed detection phenomenon of YOLOv8, (b) the wrong detection phenomenon of YOLOv8, and (c) the improvement in the target detection accuracy of RCF-YOLOv8 compared to that of YOLOv8.

Figure 13. Comparison of the classification results of each algorithm. (a–h): SSD, Faster R-CNN, YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, YOLOv7s, YOLOv8s, and RCF-YOLOv8.

Figure 14. Comparison of detection accuracy results of different algorithms: (a) The change curve of mAP50, (b) the change curve of mAP50-95.

Figure 15. Sonar images with different types and intensities of noise added: (a,b) added speckle noise with variances of 0.2 and 1.0, respectively; (c,d) added Gaussian noise with variances of 0.01 and 0.12, respectively; (e,f) added impulse noise with variances of 0.02 and 0.10, respectively.

Figure 16. The model is trained and tested on each dataset. (a) represents the validation of different test sets using the training set constructed with speckle noise with a variance of 0.2, (b) represents the validation of different test sets using the training set constructed with speckle noise with a variance of 1.0, (c) represents the validation of different test sets using the training set constructed with Gaussian noise with a variance of 0.01, (d) represents the validation of different test sets using the training set constructed with Gaussian noise with a variance of 0.12, (e) represents the validation of different test sets using the training set constructed with impulse noise with a variance of 0.02, (f) represents the validation of different test sets using the training set constructed with impulse noise with a variance of 0.10.

Table 1. Ablation of model configurations for underwater object detection.

Experiment	YOLOv8	CoordConv	EMA	C2f-Fusion Module	mAP50 (%)	mAP50-95 (%)	Params (M)	FPS
1	✓	-	-	-	96.4	57.2	11.14	909
2	✓	✓	-	-	98.8	67.0	11.16	227
3	✓	-	✓	-	98.6	67.1	11.55	833
4	✓	-	-	✓	98.5	65.7	11.56	833
5	✓	-	✓	✓	98.8	66.9	11.97	769
6	✓	✓	-	✓	98.7	67.4	11.58	222
7	✓	✓	✓	-	98.7	67.1	11.57	217
8	✓	✓	✓	✓	98.8	67.6	11.99	208

Note: ✓ indicates the inclusion of the component and “-” denotes absence.

Table 2. Comparison of model configurations for underwater object detection.

Model	Input Size	mAP50 (%)	mAP5095 (%)	Params (M)	GFLOPs
Faster R-CNN	640 × 640	64.5	24.0	137.1	370.2
SSD	640 × 640	91.9	46.2	26.29	62.7
YOLOv3-tiny	640 × 640	96.0	51.4	8.69	13.0
YOLOv4-tiny	640 × 640	94.9	48.0	5.89	16.2
YOLOv5s	640 × 640	95.4	46.2	7.04	16.0
YOLOv7s	640 × 640	89.7	38.8	6.03	13.2
YOLOv8s	640 × 640	96.4	57.2	11.14	28.7
RCF-YOLOv8	640 × 640	98.8	67.6	11.99	29.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Chen, Y.; Liu, X.; Qin, Z.; Wan, J.; Yan, Q. RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images. Remote Sens. 2025, 17, 3288. https://doi.org/10.3390/rs17193288

AMA Style

Li X, Chen Y, Liu X, Qin Z, Wan J, Yan Q. RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images. Remote Sensing. 2025; 17(19):3288. https://doi.org/10.3390/rs17193288

Chicago/Turabian Style

Li, Xiaoxue, Yuhan Chen, Xueqin Liu, Zhiliang Qin, Jiaxin Wan, and Qingyun Yan. 2025. "RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images" Remote Sensing 17, no. 19: 3288. https://doi.org/10.3390/rs17193288

APA Style

Li, X., Chen, Y., Liu, X., Qin, Z., Wan, J., & Yan, Q. (2025). RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images. Remote Sensing, 17(19), 3288. https://doi.org/10.3390/rs17193288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RCF-YOLOv8: A Multi-Scale Attention and Adaptive Feature Fusion Method for Object Detection in Forward-Looking Sonar Images

Abstract

Highlights

Abstract

1. Introduction

2. Methods

2.1. CoordConv

2.2. EMA Module

2.3. C2f-Fusion Module

3. Experimental Results and Analysis

3.1. Sonar Image Dataset

3.2. Experimental Setup and Model Training

3.3. Performance Evaluation Indicators

3.4. Experimental Results

3.4.1. Image Denoising Strategy

3.4.2. Ablation Experiment

3.4.3. Comparative Experiment

3.5. Verification of the Generalization Ability of Classification Models

4. Discussion

4.1. Acoustic Image Object Detection Network

4.2. Challenges in Underwater Object Detection

4.3. Generalization and Interpretability Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI