Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection

Zheng, Kun; Liang, Haoshan; Zhao, Hongwei; Chen, Zhe; Xie, Guohao; Li, Liguo; Lu, Jinghua; Long, Zhangda

doi:10.3390/jmse12122326

Open AccessArticle

Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection

by

Kun Zheng

¹,

Haoshan Liang

²,

Hongwei Zhao

^1,*,

Zhe Chen

^3,4,*,

Guohao Xie

^3,4,5,

Liguo Li

³,

Jinghua Lu

³ and

Zhangda Long

³

¹

Graduate School, Guilin University of Electronic Technology, Guilin 541004, China

²

School of Life and Environmental Sciences, Guilin University of Electronic Technology, Guilin 541004, China

³

School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China

⁴

Cognitive Radio and Information Processing Key Laboratory Authorized by China’s Ministry of Education Foundation, Guilin University of Electronic Technology, Guilin 541004, China

⁵

School of Ocean Engineering, Guilin University of Electronic Technology, Beihai 536065, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(12), 2326; https://doi.org/10.3390/jmse12122326

Submission received: 18 November 2024 / Revised: 5 December 2024 / Accepted: 13 December 2024 / Published: 18 December 2024

(This article belongs to the Special Issue Application of Deep Learning in Underwater Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

The need for precise identification of underwater sonar image targets is growing in areas such as marine resource exploitation, subsea construction, and ocean ecosystem surveillance. Nevertheless, conventional image recognition algorithms encounter several obstacles, including intricate underwater settings, poor-quality sonar image data, and limited sample quantities, which hinder accurate identification. This study seeks to improve underwater sonar image target recognition capabilities by employing deep learning techniques and developing the Multi-Gradient Feature Fusion YOLOv7 model (MFF-YOLOv7) to address these challenges. This model incorporates the Multi-Scale Information Fusion Module (MIFM) as a replacement for YOLOv7’s SPPCSPC, substitutes the Conv of CBS following ELAN with RFAConv, and integrates the SCSA mechanism at three junctions where the backbone links to the head, enhancing target recognition accuracy. Trials were conducted using datasets like URPC, SCTD, and UATD, encompassing comparative studies of attention mechanisms, ablation tests, and evaluations against other leading algorithms. The findings indicate that the MFF-YOLOv7 model substantially surpasses other models across various metrics, demonstrates superior underwater target detection capabilities, exhibits enhanced generalization potential, and offers a more dependable and precise solution for underwater target identification.

Keywords:

deep learning; underwater images; underwater target recognition; sonar images; improved YOLO

1. Introduction

The need for precise recognition in underwater target detection is growing rapidly, encompassing fields such as marine resource exploitation, subsea construction, and oceanic ecosystem monitoring. Sonar image target recognition plays a vital role in these applications [1,2,3,4,5]. Nevertheless, the intricate underwater environment, low-quality sonar image data, and limited sample sizes pose significant challenges to conventional image recognition algorithms, hindering accurate identification. Consequently, there is a pressing need to develop more effective methods to enhance sonar image target recognition performance.

This research aims to address these issues by leveraging deep learning techniques. The main goal is to create a robust underwater sonar image target recognition system that can handle environmental complexities, improve target identification precision, and resolve the problem of limited data samples. The innovative aspect of this research lies in the creation of the Multi-Gradient Feature Fusion YOLOv7 model (MFF-YOLOv7) using deep learning techniques. This model introduces several key improvements: a novel Multi-Scale Information Fusion Module replaces YOLOv7’s SPPCSPC, enabling better feature capture of varying target sizes in sonar images; RFAConv substitutes the Conv in the two CBSs following ELAN; and the SCSA mechanism is incorporated at three junctions between the backbone and head to enhance the model’s ability to handle underwater environmental complexities, focus on relevant target recognition features, and improve recognition accuracy. These enhancements are expected to significantly improve YOLOv7’s performance in sonar image recognition, contributing substantially to underwater target detection and offering more reliable and efficient solutions for various applications.

Sonar images originate from imaging sonar. When operating as an active sonar system, the process is as follows:

The sonar system emits sound waves.
The sound waves pass through the water, reflect off underwater targets, and return.
The reflected echoes return to the sonar system.
Images are formed through the complex processing of these echoes.

The underwater environment’s complexity and inherent unpredictability make the imaging process susceptible to medium-related influences. Echo signals often encounter issues such as attenuation and distortion, resulting in sonar images with reduced contrast and resolution, indistinct target boundaries, and barely discernible features [6,7,8,9].

Conventional methods for recognizing targets in sonar images primarily rely on features based on pixels, grayscale values, or preconceived notions about the targets [10,11], often resulting in limited accuracy. In recent times, the field of computer vision has been revolutionized by deep learning, which has subsequently advanced underwater target detection. Deep learning-based target detection approaches are typically categorized into two-stage and one-stage algorithms. Two-stage algorithms first identify potential regions containing targets, followed by classification and localization within these areas. For instance, Villon et al. [12] employed convolutional neural networks to swiftly identify fish in marine images, achieving a 94.9% accuracy rate. Guo et al. [13] utilized deep residual networks to recognize sea cucumbers with an 89.5% accuracy rate. Dai et al. [14] created a dual-branch backbone network called GCC-Net, which uses both enhanced and original images as input to train underwater target detectors. However, these methods are computationally intensive, slow, and require numerous candidate regions. Moreover, when dealing with complex sonar images, there is still potential for improving accuracy.

Unlike two-stage methods, single-stage algorithms like the YOLO [15,16,17,18,19,20,21] family eliminate the need for a candidate region network, instead generating prediction boxes directly on the input image for target detection. Muksit [22] and colleagues introduced the YOLO-Fish algorithm in 2022, achieving 76.56% accuracy in identifying 20 distinct fish species in their habitats. In 2024, Liu et al. [23] developed a modular underwater enhancement component that could be integrated into YOLOv5, resulting in a 2.6% increase in mAP on the DUO dataset. Lei et al. [24] enhanced YOLOv5’s backbone network by incorporating the Swin Transformer. Although it achieved a small increase in mAP, it significantly increased the model volume. Although the YOLO algorithm has a speed advantage, it occasionally has missed or false detections in complex environments with high noise points and dense small targets, raising concerns about its stability.

Traditional image recognition algorithms struggle with accurately processing low-resolution sonar image data, which involves a complex procedure. Even sophisticated deep learning models have room for enhancement when dealing with intricate sonar images. For example, the YOLO framework encounters several challenges, including poor image quality in complex underwater settings, challenges in detecting small and closely grouped targets, and issues with both missed detections and false positives. Although YOLOv9 and YOLOv10 excel in optical image processing, for achieving high accuracy with minimal resources, YOLOv7 proves more advantageous for sonar images. This is because underwater sonar targets are typically small-scale and accompanied by increased noise, and YOLOv7 demonstrates superior recognition capabilities for small targets. Consequently, this study aims to enhance YOLOv7 to address these challenges. It introduces MFF-YOLOv7, which improves noise processing in low-resolution scenarios and enhances detection of dense, small targets.

In the field of underwater target image recognition, the traditional YOLOv7 model has significant limitations when dealing with high-noise and unclear sonar images, such as the complex underwater environment, various target sizes, numerous interfering information in sonar images, low imaging resolution, and small and dense targets. These factors can cause missed and false detections in the model. To solve the above problems, we propose the MFF-YOLOv7 model. First, we introduce the original Multi-Scale Information Fusion Module (MIFM) to replace SPPCSPC. The MIFM can better fuse multi-scale information and enhance the model’s processing ability of features at different scales. Even in complex underwater scenes, it can accurately identify targets of various sizes, thereby effectively solving the problem of processing features at different scales due to the large size differences in underwater targets. Secondly, we replace the Conv in the CBS following ELAN with RFAConv. Given the high noise and unclearness of sonar images, the existing feature extraction methods have deficiencies, while the RFAConv has better feature extraction capabilities and is more adaptable to specific types of sonar image data. It can significantly improve the model’s learning and representation of sonar image features, enabling it to extract useful target features from noise better.

The SCSA mechanism should be implemented at the three junctions where the backbone connects to the head. In underwater target recognition, sonar images often contain numerous interfering elements, which can lead to the model being influenced by irrelevant information. By employing the SCSA mechanism, the model can prioritize crucial feature information and minimize the impact of unrelated data. This allows the model to concentrate more effectively on target recognition-related features when transferring information from the backbone to the head, thus enhancing the model’s recognition accuracy. To assess the efficacy of the proposed approach, we conducted rigorous comparative evaluations using various real-world sonar image datasets, including URPC [25], SCTD [26], and UATD [27].

To conclude, the core innovations of this paper are embodied in the following key contributions:

The traditional YOLOv7 model has many limitations when dealing with high-noise and unclear sonar images, such as the complex underwater environment, various target sizes, numerous interfering information, low imaging resolution, and small and dense targets. These factors can easily lead to missed detections and false detections. To address these issues, we have designed the MFF-YOLOv7 model.
The Multi-Scale Information Fusion Module (MIFM) has been introduced and implemented to enhance the YOLOv7 model. This module excels at integrating information from various scales, thereby improving the model’s ability to process features at different levels, particularly in complex underwater environments where target dimensions fluctuate. The MIFM demonstrates robust fusion capabilities, overcoming the constraints of conventional modules and effectively capturing characteristics of targets with diverse sizes. Additionally, the MIFM can dynamically adjust its focus on targets of varying scales based on the actual underwater conditions, enabling intelligent resource allocation. This mechanism substantially enhances the precision of sonar image target identification while minimizing instances of missed and false detections.
Rigorous comparative evaluations were conducted on the real-world sonar image datasets URPC, SCTD, and UATD. The results indicate that the MFF-YOLOv7 model performs exceptionally well in these two datasets. It demonstrates good performance on specific datasets, exhibits strong generalization ability, and can adapt to sonar image recognition tasks in different scenarios.

The article’s subsequent sections are organized as follows: Section 2 focuses on introducing YOLOv7 and several enhanced modules, including MIFM, RFAConv, and the SCSA mechanism, along with the architecture and enhancements of the MFF-YOLOv7 network. These module improvements aim to enhance the model’s effectiveness in recognizing targets in underwater sonar images. Section 3 offers a comparative evaluation of MFF-YOLOv7 against leading sonar image recognition technologies and presents findings from multiple real-world datasets. Section 4 concludes the article by summarizing the research outcomes and providing closing remarks.

2. Background

In underwater target image recognition, the continuous development of related technologies provides strong support for achieving more accurate and efficient target detection. This part will introduce the related work. Firstly, the traditional target detection algorithm YOLOv7 will be expounded, and its improvement direction will be derived. In sequence, the following will introduce YOLOv7, the Multi-Scale Information Fusion Module (MIFM), RFAConv, the SCSA mechanism, and the ultimately proposed MFF-YOLOv7. Through an in-depth analysis of these technologies, the innovation and breakthroughs of this study in underwater target image recognition will be demonstrated.

2.1. YOLOv7

YOLOv7, as an advanced target detection algorithm, holds an important position in computer vision, especially excelling in target detection tasks. As shown in Figure 1, the structure of YOLOv7 mainly consists of key components such as Backbone, Head, and Prediction. The input image with a size of 640 × 640 × 3 first enters the Backbone part. Here, through a series of elaborately designed network layers, such as the efficient ELAN module, as well as operations like convolution (Conv), batch normalization (BN), activation function (SiLU), etc., a multi-level feature extraction is performed on the image. During the feature extraction, specific structures such as the SPPCSPC module play an important role in feature fusion, capable of integrating feature information at different levels. The image undergoes multiple processing and downsampling operations, such as reducing the feature dimension through operations like Maxpool, thereby gradually forming feature maps of different scales. Subsequently, these features enter the head part for further analysis and processing, and finally, the prediction results are output in the form of three tensors of specific sizes.

2.2. Multi-Scale Information Fusion Module (MIFM)

There are many complex challenges in target detection, especially underwater target image recognition. The particularity of the underwater environment makes it difficult for traditional target detection methods to meet the demand for accurate recognition. We introduce the Multi-Scale Information Fusion Module (MIFM) to address these challenges effectively.

The details of MIFM are shown in Figure 2. This module first expands the feature channels through two 1 × 1 convolution operations, with an expansion ratio of γ = 2, to increase the feature dimension and provide richer information for subsequent processing. Then, the input features are divided into two parallel paths for processing. One of the paths involves a gating mechanism, and the element-wise product of the features of the two paths enhances the nonlinear transformation, enabling the module to capture the complex relationships between features better. In the lower path, depth wise convolution is used for feature extraction, which can effectively extract features and reduce the computational amount. The module uses two 3 × 3 dilated convolutions with dilation rates of 2 and 3, respectively. Using these convolutions achieves multi-scale feature extraction, extracting features at different scales and improving its adaptability to targets of different sizes.

Given the input tensor

X \in R^{H \times W \times C}

, where

H

(Height),

W

(Width),

C

(Channel), MIFM is formulated as:

U p (X) = W_{1 \times 1} (φ (W_{3 \times 3 \times 3} W_{1 \times 1} (X)) Θ (W_{3 \times 3}^{2} W_{1 \times 1} (X) + W_{3 \times 3}^{3} W_{1 \times 1} (X)))

(1)

D o w n (X) = W_{1 \times 1} (X) + R (G N (W_{3 \times 3} (X)))

(2)

X_{o u t} = C o n c a t (U p (X), D o w n (X))

(3)

Θ

represents element-wise multiplication,

φ

represents GELU nonlinearity,

W_{3 \times 3}^{2}

represents

3 \times 3

dilated convolution with a dilation rate of 2, and

W_{3 \times 3}^{3}

represents

3 \times 3

dilated convolution with a dilation rate of 3.

W

represents convolution, and the subscript indicates the kernel size of the convolution. R represents the Leaky ReLU activation function, and GN represents the Group Norm.

The Multi-Scale Information Fusion Module (MIFM) has significant underwater target detection advantages. Compared with the original SPPCSPC module of YOLOv7, MIFM can more effectively fuse information of different scales and has stronger adaptability for various underwater target sizes and complex environments. It can automatically adjust the attention to targets of different scales according to the actual situation to intelligently allocate resources, while SPPCSPC could be more flexible in this aspect. In addition, MIFM introduces operations such as gating mechanisms, depthwise convolution, and multi-scale dilated convolution, which can better capture the complex relationships between features, achieve more powerful nonlinear transformations, and enhance the extraction ability of features for targets of different sizes. In contrast, the feature processing method of SPPCSPC is relatively single. In underwater target detection tasks, MIFM significantly improves the accuracy of sonar image target recognition, reduces missed detections and false detections, and performs better in dealing with the challenges of the complex underwater environment.

2.3. RFAConv

In target detection, especially in underwater target image recognition tasks, continuously exploring more effective feature extraction and fusion methods is crucial. To further enhance the model performance and better adapt to the complex underwater environment, we introduce a new type of convolution structure, RFAConv (Recurrent Feature Aggregation Convolution). Traditional convolution methods may have certain limitations when dealing with underwater target detection problems, such as insufficient extraction of multi-scale features and difficulty capturing the complex features of targets. Therefore, the replacement with the RFAConv aims to overcome these problems and improve underwater target detection performance.

The structural diagram of RFAConv (Recurrent Feature Aggregation Convolution) is depicted in Figure 3. RFAConv is crucial for target detection, enhancing network performance by learning an attention map through interaction with receptive field feature information. To mitigate the additional computational burden caused by this interaction, RFAConv employs AvgPool for pooling global information from each receptive field feature. It then facilitates information exchange 1 × 1 group convolution operations and utilizes softmax to highlight the significance of individual features within the receptive field. Additionally, receptive field attention (RFA) is implemented on the spatial features of the receptive field. This approach not only emphasizes the importance of various features within the receptive field but also considers its spatial characteristics, effectively addressing the issue of convolution kernel parameter sharing. The receptive field spatial features are dynamically generated, and RFA forms a fixed combination with convolution, with both elements working interdependently to boost performance. In essence, RFAConv’s unique design substantially improves target detection network performance while efficiently managing computational overhead and parameter count.

The calculation of RFAConv can be expressed as:

F = S o f t \max (g^{1 \times 1} (A v g P o o l (X))) \times Re L U (N o r m (g^{k \times k} (X))),

(4)

where

g

represents group convolution, and the superscript indicates the size of the convolution kernel.

X

represents the input feature map, and the output

F

is obtained by multiplying the attention map with the transformed receptive field spatial feature. Softmax and ReLU represent activation functions, AvgPool represents the max pooling operation, and Norm is the normalization operation.

Selection and position arrangement of convolution structures are crucial in exploring improvements in target detection models. Different results are produced when RFAConv replaces the CBS (Convolution-BatchNorm-SiLU) structure at different positions. When replacing other CBSs, the effect worsens, while the best effect is achieved when it is placed at the CBS position immediately following the ELAN module in YOLOv7, replacing the Conv with RFAConv.

The reasons for this phenomenon mainly lie in two aspects. On the one hand, the ELAN module plays a key role in feature extraction, and its output features have specific patterns and characteristics. The CBS position immediately following it is crucial for further processing and optimizing these features. The unique recurrent feature aggregation method of RFAConv can better adapt to the characteristics of the output features of the ELAN module, achieving more effective feature fusion and extraction. The CBS at other positions may receive input features that do not match the characteristics of the RFAConv, be unable to exert its advantages fully, and even introduce inappropriate processing, resulting in a worse effect. On the other hand, at this specific position after the ELAN module, RFAConv can collaborate better with the preceding and subsequent modules, better undertake and integrate the features extracted by the upstream modules, and provide a higher-quality feature representation for subsequent processing. Due to the different interaction methods between the modules, a good collaboration effect cannot be achieved at other positions, thereby affecting the model’s overall performance.

In summary, placing RFAConv at the CBS position immediately following the ELAN module in YOLOv7 is well-considered. This position can give full play to the advantages of RFAConv, form good coordination with the surrounding modules, significantly enhance the model’s ability to extract features of underwater targets, reduce the occurrence of missed detections and false detections, enhance the robustness and adaptability of the model, improve the accuracy and stability of target detection, and thereby achieve better underwater target detection performance.

2.4. Spatial and Channel Synergistic Attention (SCSA) Module

Attention mechanisms are increasingly important in target detection and computer vision. They can help the model focus more precisely on key information, thereby enhancing the features’ expression ability and the model’s performance. However, current attention mechanisms still have certain limitations in dealing with multi-semantic information and spatial–channel collaboration.

We introduce a brand-new attention mechanism, SCSA (Spatial and Channel Synergistic Attention), to overcome these drawbacks and fully explore the collaboration between spatial and channel attention. SCSA aims to achieve more accurate feature extraction by efficiently integrating multi-semantic spatial information and channel information to obtain better model performance. Next, we will elaborate on the specific structure and working principle of SCSA in detail.

SCSA (Spatial and Channel Synergistic Attention) is a new type of attention mechanism. Its structure is shown in Figure 4. This attention mechanism aims to explore the collaboration between spatial and channel attention. It mainly comprises Shared Multi-Semantic Spatial Attention (SMSA) and Progressive Channel Self-Attention (PCSA).

Shared Multi-Semantic Spatial Attention (SMSA) module:

Spatial and Channel Decomposition: We break down the given input $X \in R^{B \times C \times H \times W}$ in terms of the height and width dimensions. Global average pooling is then applied to each of these dimensions, which gives rise to two one-way 1D sequence structures, namely $X_{H} \in R^{B \times C \times W}$ and $X_{W} \in R^{B \times C \times H}$ . In order to capture different spatial distributions as well as contextual relationships, we divide the feature set into $K$ sub-features that are of the same size and are independent of each other. These sub-features are named $X_{H}^{i}$ and $X_{W}^{i}$ , and each sub-feature has a channel count of $\frac{C}{K}$ . In this paper, we have set the default value as $K = 4$ . The procedure for decomposing into sub-features is detailed as follows:

$X_{H}^{i} = X_{H} [:, (i - 1) \times \frac{C}{K} : i \times \frac{C}{K}, :]$

(5)

$X_{W}^{i} = X_{W} [:, (i - 1) \times \frac{C}{K} : i \times \frac{C}{K}, :]$

(6)

where

X^{i}

represents the

i

-th sub-feature.

2.: Efficient Convolutional Approach: Implement separable one-dimensional convolutions with filter sizes of 3, 5, 7, and 9 across the four sub-features to detect various semantic spatial patterns. Concurrently, employ efficient shared convolutions for alignment to tackle the restricted receptive field issue resulting from splitting features into H and W dimensions and utilizing 1D convolutions. The process of obtaining diverse semantic spatial data are defined as where a denotes the b-th sub-feature.

${\tilde{X}}_{H}^{i} = D W C o n v 1 d_{k_{i}}^{\frac{C}{K} \to \frac{C}{K}} (X_{H}^{i})$

(7)

${\tilde{X}}_{W}^{i} = D W C o n v 1 d_{k_{i}}^{\frac{C}{k} \to \frac{C}{k}} (X_{W}^{i})$

(8)

X_{i}

represents the spatial structure information obtained by the

i

-th sub-feature after the lightweight convolution operation, and

k_{i}

represents the convolution kernel applied to the

i

-th sub-feature.

3.: Computing the Spatial Attention Map: Aggregate different semantic sub-features, normalize them using group normalization (GN) of K groups, and then generate spatial attention through the Sigmoid activation function. The formula for calculating the output feature is:

$A t t n_{H} = σ (G N_{H}^{K} (C o n c a t (\tilde{X_{H}^{1}}, \tilde{X_{H}^{2}}, \dots, \tilde{X_{H}^{K}})),$

(9)

$A t t n_{W} = σ (G N_{W}^{K} (C o n c a t (\tilde{X_{W}^{1}}, \tilde{X_{W}^{2}}, \dots, \tilde{X_{W}^{K}})),$

(10)

where

σ

represents Sigmoid normalization, and

G N_{K}^{H}

and

G N_{K}^{W}

represent Group Norm along the H and W dimensions, respectively.

Progressive Channel-wise Self-Attention (PCSA) module:

Explore the dependencies between channels through convolution operations. We were inspired by using MHSA in ViT to model the similarities between different tokens in calculating spatial attention, combined with the spatial priors modulated by SMSA to calculate the similarities between channels. A progressive compression method is adopted to preserve and utilize the multi-semantic spatial information extracted by SMSA and reduce the computational cost of MHSA. The specific implementation is as follows:

X_{p} = P o o l_{(7, 7)}^{(H, W) \to (H^{'}, W^{'})} (X_{s}),

(11)

F_{p r o j} = D W C o n v 1 d_{(1, 1)}^{C \to C},

(12)

Q = F_{\Pr o j}^{Q} (X_{p}),

(13)

K = F_{\Pr o j}^{K} (X_{p}),

(14)

V = F_{\Pr o j}^{V} (X_{p}) .

(15)

P o o l_{(7, 7)}^{(H, W) \to (H^{'}, W^{'})}

represents the pooling operation with a kernel size of 7 × 7, adjusting the resolution from

(H, W)

to

(H^{'}, W^{'})

, and

F_{p r o j}

represents the mapping function for generating queries, keys, and values.

Collaboration effect: SCSA guides the learning of channel attention through spatial attention. SMSA extracts multi-semantic spatial information from each feature, providing precise spatial priors for channel attention calculation; PCSA refines the semantic understanding of local sub-features by using the overall feature map X, reducing the semantic differences caused by multi-scale convolutions in SMSA. The final constructed SCSA is:

S C S A (X) = P C S A (S M S A (X)) .

(16)

In the underwater sonar image target recognition task, introducing the SCSA (Spatial and Channel Synergistic Attention) attention mechanism is of great significance. The existing attention mechanisms have limitations in dealing with multi-semantic information and spatial-channel collaboration, while SCSA aims to overcome these limitations. Integrating multi-semantic spatial and channel information effectively can achieve more accurate feature extraction, thereby improving the model performance. For underwater sonar image target recognition, the SCSA mechanism offers several benefits. It enhances the model’s capacity to concentrate on crucial information, boosts feature representation capabilities, and improves the detection and identification of underwater targets. At the same time, it can effectively integrate multi-semantic spatial information, helping the model to learn higher-quality feature representations and better cope with problems such as various sizes of underwater targets and complex environments. In addition, through the collaboration of space and channel, SCSA can extract target features more accurately and reduce the occurrence of missed and false detections.

Adding the SCSA mechanism at the three connection positions—where the backbone connects to the head—enhances the model’s ability to pay attention to and extract features and improve the model’s performance in tasks such as target detection and recognition. The SCSA can effectively integrate multi-semantic spatial information and channel information. Through the collaboration of space and channel, it can better focus on key information and improve the ability to express features. Introducing SCSA at these three connection positions enables the model to better learn and utilize features at different stages and enhance the attention to different scales and semantic information, thereby improving the recognition accuracy of underwater sonar image targets. Incorporating SCSA at these locations can yield several benefits. It can enhance the critical distribution of features, prompting the model to focus more on essential characteristics and improve feature utilization. Additionally, it can enhance feature fusion, allowing for better integration of features from various levels and scales, thereby boosting feature expressiveness and resilience. Furthermore, it can enhance the model’s generalization capabilities by learning more representative features, mitigating overfitting risks, and improving the model’s adaptability across diverse datasets and tasks.

2.5. MFF-YOLOv7

In the field of underwater target image recognition, the conventional YOLOv7 model exhibits significant limitations when confronted with challenges such as complex underwater environments, diverse target sizes, abundant interfering information in sonar images, low imaging resolution, and small, densely clustered targets. Consequently, it is susceptible to instances of missed detections and false detections. To address these issues, we propose the MFF-YOLOVv7 model, which aims to enhance the performance and accuracy of the model in underwater target image recognition. Figure 5 illustrates the structural diagram of MFF-YOLOv7, demonstrating its distinctive design.

The traditional YOLOv7 model needs to improve when dealing with the problems of complex underwater environments and various target sizes. To solve this problem, the MFF-YOLOv7 model introduces the original Multi-Scale Information Fusion Module (MIFM) to replace SPPCSPC MIFM can fuse multi-scale information more effectively and enhance the model’s ability to process features of different scales. Applying the Multi-Scale Information Fusion Module (MIFM) enables the model to identify targets of various sizes in complex underwater scenes accurately. It effectively solves the problem that the traditional module makes it difficult to effectively process features of different scales due to the large size differences in underwater targets. By introducing MIFM, our model can better adapt to the diversity of underwater environments and improve the ability to detect and recognize targets.

Considering the characteristics of sonar images, such as high noise and unclearness, the existing feature extraction methods need to be improved when dealing with such images. Therefore, we replace the Conv in the CBS immediately following ELAN with RFAConv. RFAConv has a better feature extraction ability and is more adaptable to specific types of sonar image data. It can significantly improve the model’s learning and representation of sonar image features, enabling it to extract useful target features from the noise better. This improvement helps to enhance the model’s performance when processing low-quality sonar images and reduces missed and false detections caused by image quality issues.

In underwater target recognition, sonar images often contain numerous interfering information, which makes the model prone to be disturbed by irrelevant information, thereby affecting recognition accuracy. To solve this problem, we introduce the SCSA mechanism at the three connection positions where the backbone connects to the head. The SCSA mechanism can help the model pay more attention to important feature information and reduce the interference of irrelevant information. By enabling the model to better focus on the features related to target recognition when transferring features from the backbone to the head, the SCSA mechanism can significantly improve the model’s recognition accuracy and enhance the model’s robustness in complex underwater environments.

The MFF-YOLOv7 model has successfully solved the problems faced by the traditional YOLOv7 model in underwater target image recognition through innovative improvements such as introducing MIFM, replacing Conv with RFAConv, and introducing the SCSA mechanism. These improvements have significantly enhanced the model’s ability to fuse multi-scale information, extract features of sonar images, and pay attention to important information, thereby greatly improving the recognition accuracy and robustness of the model. Our improvements provide a more effective solution for underwater target image recognition and have the potential to achieve better results in practical applications.

3. Experimental Design and Experimental Analysis

This section demonstrates the effectiveness and generalization of the proposed method through underwater sonar image target detection experiments. In this section we briefly introduce the datasets and evaluation metrics used in the experiments and the experimental settings. Subsequently, a comparison experiment of attention mechanisms will be conducted to prove the advantages of the SCSA mechanism. Next, ablation experiments will be carried out to prove the effectiveness of each improvement. Then, improvements should be made compared with other mainstream algorithms to prove the superiority of the improved algorithm. Finally, other datasets will be verified to prove the generalization of the improved algorithm.

3.1. Experimental Environment

To verify the model’s effectiveness, we conducted comparative experiments on the URPC, SCTD and UATD datasets to verify the detection effect of the model. The operating system is Windows 11, the deep learning framework is PyTorch 1.4.0, the CPU is Intel Core i7 12700H, the memory is 32 GB, and the GPU is NVIDIA GeForce GTX 3070TI.

3.2. Experimental Indicators

The assessment criteria were Precision, Recall, MAP using the following formulas:

P r e c i s i o n = \frac{T P}{T P + F P},

(17)

R e c a l l = \frac{T P}{T P + F N} .

(18)

Here,

T P

refers to the number of samples that are correctly judged as positive examples. It represents the situation of successfully identifying the target category during the prediction or classification process.

F P

refers to the number of samples that are wrongly judged as positive examples. It indicates the situation where the model wrongly classifies negative examples as positive ones.

F N

refers to the number of samples that are wrongly judged as negative examples. It represents the situation where the model wrongly classifies positive examples as negative ones and misses them.

mAP is an abbreviation of average accuracy and an indicator of recognition accuracy in target recognition. When there are multiple classes to be detected or classified, each class has its own average precision (AP). mAP is the average of the average precisions of all these classes. mAP provides a comprehensive metric to measure the performance of a multi-class classification model and can comprehensively reflect the accuracy of the model on different classes.

A P = \int_{0}^{1} P (r) d r,

(19)

m A P = \frac{\sum_{n = 1}^{N} A P_{n}}{N},

(20)

where

p

represents precision,

r

represents recall,

p

is regarded as a function with

r

as a parameter,

n

is the number of object categories, and

A P_{n}

represents the average precision of the neural network when recognizing specific types of targets. MAP has two evaluation metrics, namely MAP@0.5 and MAP@0.5:0.95. MAP@0.5 represents the average precision when the Intersection over the Union (IoU) threshold is 0.5 and is an evaluation metric with relatively low detection requirements. MAP@0.5:0.95 is the average precision calculated at multiple IoU thresholds (starting from 0.5, increasing by 0.05 as the step size up to 0.95), representing an evaluation metric with higher detection requirements. It can more comprehensively evaluate the performance of the model under different precision requirements. The above indicators are used to measure the accuracy of the neural network.

Intersection over Union (IoU) is an important metric widely used in fields such as computer vision to measure the degree of overlap between two sets (usually two regions represented by bounding boxes in images). It is calculated by dividing the intersection (i.e., the overlapping part) of the two sets by the union (all included parts), and the calculation formula is:

I O U = \frac{A \cap B}{A \cup B} .

(21)

Its value range is between 0 and 1. A value of 0 indicates no overlap between the two sets, such as two detection boxes corresponding to different objects in object detection; a value of 1 indicates complete coincidence, that is, the positions of the detection boxes are exactly the same and correspond to the same object. In the object detection task, it is mainly used to evaluate the matching degree between the predicted object bounding box and the real bounding box. By setting a threshold (such as 0.5) we determine that if it is greater than this threshold, the predicted box is considered to accurately detect the target; if it is lower, it may be a wrong or inaccurate detection.

3.3. Experimental Results and Analysis of the URPC Dataset

The URPC dataset encompasses 10,875 images, which are categorized into four subsets: sea urchin, sea cucumber, scallop, and starfish. The distribution of samples among these subsets is highly unbalanced. As shown in Figure 6a, the category statistics chart reveals that the sea urchin subset has the largest number of samples, followed by the starfish and sea cucumber subsets, with the scallop subset having the fewest.

This dataset is divided into a training set and a test set in an 8:2 ratio to support the training and testing of the proposed algorithm. After this division, 8700 images are allocated for training and 2175 for testing. However, to further improve the model’s performance and generalization ability, a validation set is also required. The division into training, validation, and test subsets is based on the following considerations. Firstly, to maintain the representativeness of each subset, the samples are divided proportionally according to the original distribution of each category. Secondly, in order to ensure that the model can be effectively evaluated and fine-tuned, a sufficient number of samples are reserved for the validation set.

The dataset showcases a range of intricate scenarios, including visual obstruction caused by clustered underwater organisms, variations in illumination, and image distortion resulting from motion capture. These challenging conditions accurately represent the underwater environment and improve the model’s ability to generalize. The box plot indicates that the dimensions of the target boxes are relatively uniform. The normalized target location map suggests that targets are primarily concentrated horizontally but more dispersed vertically. The standardized target size map reveals that target dimensions are fairly consistent, with most being small in size. Sample images from the URPC dataset are displayed in Figure 6b.

3.3.1. Attention Mechanism-Contrast Trials

Conducting the attention comparison experiment on the UPRC dataset is crucial for in-depth research and evaluation of the performance of different attention mechanisms. This experiment focuses on analyzing the attention comparison experiment of YOLOv7, aiming to prove the advantages of the SCSA mechanism through detailed data and result demonstrations.

It can be seen from Table 1 that APechinus, APstarfish, APholothurian, and APscallop represent the average precision of different underwater targets, respectively. YOLOv7-SCSA performs well in these four indicators. APechinus has a value of 86.0%, which is higher than the 85.0% of other models like YOLOv7. For instance, we can use APechinus to illustrate this comparison. The result shows that the SCSA mechanism can extract features more precisely when handling sea urchin targets.

As a consequence, our strategy can be improved. For APstarfish, the value of YOLOv7-SCSA is also relatively high, indicating that this mechanism is also effective in starfish target detection. Regarding APholothurian and APscallop, SCSA also shows advantages, indicating that it is adaptable to different types of underwater targets and can perform effective feature extraction for the characteristics of different targets.

Both mAP@0.5 and mAP@0.5:0.95 are two metrics that reflect the comprehensive detection capabilities of a model under different Intersections over Union (IoU) thresholds. YOLOv7-SCSA achieves 74.8% in mAP@0.5, which is higher than other models. The fact that YOLOv7-SCSA achieves 74.8% in mAP@0.5, which is higher than that of other models, implies that under relatively lower detection requirements, the SCSA can enhance the overall detection effect of the model for various targets. Moreover, in terms of mAP@0.5:0.95, its value reaches 43.5%, which is significantly better than other models. Since this metric considers multiple IoU thresholds, it demonstrates that the SCSA can play a good role under different precision requirements. It has a strong generalization ability and can adapt to various complex detection scenarios, thus providing powerful support for the comprehensive evaluation of the model’s performance.

The Precision metric represents the precision rate, which is the proportion of samples that are truly positive among those predicted as positive examples. The Precision rate of YOLOv7-SCSA is 84.9%, which is higher than that of other models. The fact that the Precision rate of YOLOv7-SCSA is 84.9%, which is higher than that of other models, indicates that during the detection process, the SCSA mechanism can identify targets more accurately, thereby reducing the occurrence of false positives. It can effectively focus on key information and make more precise judgments on targets, enhancing the accuracy and reliability of the model. Such a high Precision rate is of crucial importance for underwater target image recognition tasks. Especially in situations where it is necessary to distinguish between different target types accurately, the SCSA can provide a more reliable basis for decision-making for the model.

The Recall indicator represents the recall rate, the proportion of samples that are positive examples and are predicted as positive examples. The recall rate of YOLOv7-SCSA is 68.9%, which is higher than that of other models. The higher performance metrics (such as, the higher precision rate or other relevant evaluation indicators mentioned before) indicate that the SCSA mechanism can better detect targets and reduce the number of missed reports. In underwater target image recognition, due to factors such as complex environments and diverse targets, missed detections are a common problem. By better focusing on crucial information and improving the feature expression ability, SCSA can effectively enhance the coverage ability of the model for targets, ensuring that more targets are correctly detected, thereby improving the practicability and effectiveness of the model.

The SCSA mechanism can better focus on essential information, improve the feature expression ability, and extract target features more accurately by effectively integrating multi-semantic spatial information and channel information, thereby significantly improving the performance of the YOLOv7 model in the underwater target image recognition task. This conclusion is consistent with the theoretical expectations of the attention mechanism and provides a more effective solution for underwater target image recognition.

3.3.2. Ablation Experiment

In the underwater target image recognition research, to deeply explore the influence of different modules on the model performance, we use the UPRC dataset to conduct module ablation experiments. This experiment aims to evaluate the importance and contribution of each module in the underwater target recognition task by purposefully removing or replacing specific modules in the model. Through conducting rigorous module ablation experiments on the UPRC dataset, we expect to more accurately understand the mechanism of action of different modules, thereby providing a solid basis for further optimizing the model performance.

Table 2 of the ablation experiment clearly shows the importance of different modules for the YOLOv7 model in underwater target image recognition. The base model (YOLOv7), when no new module is added, has an accuracy of 82.1%, a recall rate of 66.9%, a mAP@0.5 of 72.1%, and a mAP@0.5:0.95 of 42.6%, providing a baseline performance for subsequent comparisons. When only the RFAConv module is added, the accuracy increases to 85.1%, and the mAP@0.5 changes to 73.1%, etc., indicating that this module, with its better feature extraction ability, plays a positive role in dealing with the problems of high noise and unclearness of sonar images, and improves the model’s ability to learn and represent target features. Then, when both the RFAConv and SCSA modules are added simultaneously, the accuracy further improves to 88.2%, and all indicators have significantly improved, highlighting the importance of the SCSA mechanism in reducing the interference of irrelevant information and enabling the model to focus on target features.

Finally, when the three RFAConv, SCSA, and MIFMs are added, the model performance is at its best. The accuracy is 89.9%, the recall rate is 73.0%, the mAP@0.5 is 79.1%, and the mAP@0.5:0.95 is 45.0%. The MIFM solves the problem of significant differences in underwater target sizes by better fusing multi-scale information. The three modules work together to make the model perform outstandingly in the underwater target image recognition task. The excellent performance shown in aspects like higher precision rates, better mAP values at different thresholds, and other remarkable achievements fully prove the respective advantages of the RFAConv, SCSA, and MIFMs and their collaboration effect. This provides a clear direction and a solid basis for further optimizing the underwater target image recognition model.

As shown in Figure 7, by comparing the two confusion matrices, it can be seen that MFF-YOLOv7 shows advantages in multiple aspects. First, for each category of targets, the accuracy of MFF-YOLOv7 has generally improved. For example, the accuracy of the “echinus” category has increased from 0.84 to 0.9. Secondly, the added modules help to reduce the misclassification rate. The RFAConv module improves the feature extraction ability, enabling the model to handle the high noise and unclearness of sonar images better; the SCSA mechanism focuses on important feature information and reduces the interference of irrelevant information. Especially when facing numerous interfering pieces of information in underwater target recognition, it can improve the recognition accuracy of the model; the MIFM effectively solves the problem that the traditional module is challenging to process features of different scales due to the significant size differences in underwater targets by better fusing multi-scale information. These advantages make MFF-YOLOv7 more accurate and reliable in the underwater target image recognition task, providing a more effective model choice for research in this field.

Figure 8a The Precision-Recall (PR) curve of YOLOv7 shows different characteristics in different categories. The precision value of the “echinus” sea urchin target is 0.850, indicating a relatively high detection precision at different recall rate levels. The “starfish” target’s precision is 0.896, showing good detection performance. However, the precision of the “scallop” target is only 0.354, indicating that the performance in detecting this type of target needs to be improved. The precision of the “holothurian” sea cucumber target is 0.785, which is at a medium level. Considering all categories, the average precision is 0.721. From the overall shape, the precision shows a downward trend as the recall rate increases. The situation requires further analysis of the specific numerical changes in different recall rate intervals better to understand the performance of YOLOv7 in various situations.

Figure 8b The PR curve of MFF-YOLOv7 has improved in all categories. The precision of the “echinus” sea urchin target has increased to 0.927, which is significantly higher than that of YOLOv7. The “starfish” target’s precision has reached 0.993, almost approaching perfect detection precision, with excellent performance. The precision of the “scallop” scallop target is 0.397. Although it is still relatively low, it has made particular progress compared to YOLOv7. The precision of the “holothurian” sea cucumber target is 0.847, a noticeable improvement. Overall, the average precision has increased to 0.791. From the overall PR curve, the curve of MFF-YOLOv7 is closer to the upper right corner, meaning that it has higher precision at the same recall rate or higher recall rate at the same precision, showing its advantage in the underwater target image recognition task.

Overall, MFF-YOLOv7 is superior to YOLOv7 in all categories and the comprehensive average precision, fully demonstrating the improvement effect of the added three modules (RFAConv, SCSA, and MIFM) on the model performance. The improvement amplitudes of precision in different categories vary. For example, the “starfish” category significantly improves amplitude, while the “scallop” category has improved but is still relatively low, suggesting that subsequent research needs to further optimize the model for scallop targets. It can be seen from the shape and position of the PR curve that MFF-YOLOv7 can maintain a high precision into a broader range of recall rates, which is very valuable for balancing detection precision and recall rate in practical applications. To sum up, by analyzing the PR curves of these two models, the performance advantages of MFF-YOLOv7 after adding the three modules in the underwater target image recognition task are clarified, providing an essential basis for further improving and optimizing the model.

Figure 9 presents a visual comparison of the performance of YOLOv7 and MFF-YOLOv7 in multiple underwater scenes, which provides valuable insights into the capabilities of both models.

In the context of multi-target detection, YOLOv7 exhibits several limitations. As shown in the figure, in complex underwater environments, it often struggles to accurately detect and distinguish between multiple targets. Small-sized and occluded targets pose particular challenges for YOLOv7. The model is prone to missed detections, where it fails to identify some of the targets present in the scene. Additionally, false detections occur, where it incorrectly identifies objects as targets. This is mainly due to the complex nature of underwater sonar images, which have low resolution and contain various interfering factors. The indistinct target discrimination ability of YOLOv7 further exacerbates these issues, making it difficult to accurately classify different types of targets. Achieving a balanced detection accuracy and recall rate for different target types is also a challenge for YOLOv7 in such scenarios.

In contrast, MFF-YOLOv7 shows significant improvements in multi-target detection. The added RFAConv module enhances the model’s feature extraction capabilities, especially in the presence of high noise and unclear sonar images, which are typical in underwater environments. This allows the model to better capture the characteristics of targets. The SCSA mechanism focuses the model’s attention on the relevant features of each target, reducing the interference from irrelevant information and improving the discrimination between different targets. The MIFM effectively fuses multi-scale information, enabling the model to handle targets of various sizes, including small-sized and occluded ones. As a result, MFF-YOLOv7 can accurately detect a greater number of targets and achieve a more balanced detection accuracy and recall rate across different target types, thereby enhancing the overall multi-target detection performance.

For single-target detection, YOLOv7 is highly susceptible to the complex factors in the underwater environment. Changes in lighting and water turbidity can significantly affect its ability to extract meaningful features from a single target. As a consequence, it has difficulties in accurately identifying single targets with indistinct features. The low annotation probability of such targets further compounds the problem, as the model has less training data to learn from. This leads to unstable detection results, with the performance varying depending on the environmental conditions.

MFF-YOLOv7, on the other hand, demonstrates higher accuracy and stability in single-target detection. The SCSA mechanism enables the model to focus precisely on the features of a single target, effectively reducing the influence of environmental interference. This allows the model to accurately identify single targets even in challenging underwater conditions. For single targets with indistinct features, the combined effect of the RFAConv and MIFMs leads to more effective feature extraction. This, in turn, improves the detection success rate, as the model can better capture the subtle characteristics of the target.

The improved MFF-YOLOv7 offers several notable advantages. Firstly, it provides higher detection accuracy in both multi-target and single-target detection scenarios, effectively reducing false and missed detections. Secondly, it exhibits better adaptability to the complex environment of multiple underwater scenes, including variations in lighting, water turbidity, significant differences in target sizes, and occlusions. This adaptability makes it a more reliable solution for underwater target detection in real-world applications. Furthermore, it achieves a more balanced performance across different types of targets, which is crucial for comprehensive underwater target recognition. The overall improvement in detection performance not only provides more accurate results but also serves as a foundation for more advanced underwater imaging and sensing applications. Additionally, the stable detection results of MFF-YOLOv7 ensure the reliability of the data generated, which is essential for subsequent analysis and decision-making processes.

In summary, the visual evidence in Figure 9, along with the detailed analysis, clearly demonstrates that MFF-YOLOv7 is significantly superior to YOLOv7 in multi-target and single-target detection and recognition in multiple underwater scenes. This superiority validates the effectiveness of the proposed modifications and highlights the potential of MFF-YOLOv7 for various underwater target detection applications.

3.3.3. Contrasting Experiments with the Other Algorithms

In the research of underwater target detection, it is essential to introduce the MFF-YOLOV7 model and compare it with other models. On the one hand, the underwater environment is complex and changeable. The forms and sizes of targets vary, and they are often affected by various factors such as lighting, water flow, and turbidity. By comparing it with other models, the performance of MFF-YOLOV7 in dealing with these complex challenges can be clarified better to evaluate its feasibility and reliability in practical applications. On the other hand, different target detection models adopt different algorithms and techniques. The advantages and disadvantages of various methods can be deeply understood through comparison, providing directions for further improvement and optimization of MFF-YOLOV7.

As shown in Table 3, the performance of different target detection models is compared on the URPC dataset. Among them, YOLOv5s has average performance in various indicators, with an accuracy of 80.4%, a recall rate of 65.5%, a mAp@0.5 of 68.9%, and a mAp@0.5:0.95 of 38.1%, indicating that there is room for improvement in its detection performance. YOLOv5m has made particular progress compared to YOLOv5s, with an accuracy of 82.6%, but the overall performance still needs to improve. YOLOv7 has a relatively balanced performance in multiple indicators, with an accuracy of 82.1%, a recall rate of 66.9%, a mAp@0.5 of 72.1%, and a mAp@0.5:0.95 of 42.6%, but there is still potential for improvement. YOLOv7-Tiny has a particular advantage with an accuracy of 82.7%, but the recall rate of 63.6% and mAp@0.5:0.95 of 36.9% are relatively low. YOLOv7-SDBB performs well in some indicators, with an accuracy of 82.0%, a recall rate of 66.8%, a mAp@0.5 of 72.4%, and a mAp@0.5:0.95 of 43.4%, but there is a gap compared to the excellent models. YOLOv8n has weak overall performance, with an accuracy of 80.1%, a recall rate of 64.8%, a mAp@0.5 of 68.6%, and a mAp@0.5:0.95 of 38.6%. YOLOv9 is comparable to YOLOv7 in some indicators, with an accuracy of 82.1%, a recall rate of 64.8%, a mAp@0.5 of 71.0%, and a mAp@0.5:0.95 of 42.0%, and there is also room for further improvement.

Our MFF-YOLOv7 model performs outstandingly in various indicators on the URPC dataset. The accuracy is as high as 89.9%, which is significantly higher than other models, and means it can identify targets more accurately and thus reduce false alarms. The recall rate is 73.0%, higher than most models, which can detect targets better and thus reduce missed alarms. The mAp@0.5 is 79.1%, and mAp@0.5:0.95 is 45.0%, leading among all models, which fully demonstrates that this model has excellent performance under different Intersection over Union (IoU) thresholds and can evaluate the detection ability of the model more comprehensively. Compared with other target detection models on the URPC dataset, the MFF-YOLOv7 model shows obvious advantages and achieves better accuracy, recall rate, and average precision, providing a more effective solution for tasks such as underwater target detection.

3.4. Experimental Results and Analysis of the SCTD

The experimental data used in this study comes from the Sonar Common Target Detection Dataset (SCTD) collected and organized by Zhou Yan, containing 596 images. The sonar targets in these images were labeled using the open-source annotation tool LabelImg. The annotated dataset mainly includes three main types of targets: sunken ships (461 images), crashed aircraft (90 images), and human bodies (45 images). Random rotation and Gaussian blurring were applied to balance the original dataset and alleviate the problem of sample imbalance. In the final photos, these three main targets contain 512, 454, and 397 instances, respectively. The dataset is divided into a training set, validation set, and test set in the ratio of 7:1:2. Figure 10 shows the schematic diagrams of these three typical sonar image targets.

Model Generalizability Experiment

Accurately identifying targets in sonar images in underwater target detection is crucial for various applications. We need to conduct extensive experimental verification to ensure that the proposed model can perform well in different scenarios. Among them, experiments on the SCTD were conducted to verify the model’s generalization. The SCTD contains various underwater sonar images covering various environments and target types. Through experiments on this dataset, we can evaluate whether the model can accurately identify and classify targets when facing complex and changeable actual situations, thereby verifying the model’s generalization ability and providing strong support for its reliability in practical applications.

Table 4 shows that our MFF-YOLOv7 model has obvious advantages and excellent generalization among various underwater target detection models.

Compared with other models, the MFF-YOLOv7 model has significantly improved all indicators. In terms of APship (average precision of ship detection), APplane (average precision of aircraft detection), and APhuman (average precision of human detection), the MFF-YOLOv7 model has reached 96.3%, 99.9%, and 99.9%, respectively, far higher than other models. In terms of mAp@0.5 (average precision when the IoU threshold is 0.5) and mAp@0.5:0.95 (average precision when the IoU threshold is between 0.5 and 0.95), the MFF-YOLOv7 model has reached 98.7% and 63.2%, respectively, also significantly ahead of other models.

This indicates that the MFF-YOLOv7 model can more accurately detect various underwater targets, including ships, aircraft, and humans. It has a higher vital generalization ability and can adapt to different underwater target detection tasks. In contrast, other models such as SSD, Faster R-CNN, YOLOv3, YOLOv4, YOLOv5, YOLOv7, and YOLOv8 perform relatively poorly in these indicators. Our MFF-YOLOv7 model has significant advantages and excellent generalization in underwater target detection and can provide more reliable and accurate detection results for related applications.

MFF-YOLOv7 shows an outstanding performance on the SCTD. It can be seen from the PR curve in Figure 11 that the average precision of all categories is as high as 0.987 when the Intersection over the Union threshold is 0.5, which shows that the model as a whole has extremely high detection accuracy. Specifically for each category, the average detection precision of ships is 0.963, and the detection effect is good. However, with the human and aircraft categories at 0.999, it is almost close to perfect detection.

Overall, this model performs well in detecting various types of targets, has a high overall average precision, has strong generalization ability and accuracy, and can accurately identify and detect different types of underwater targets. Generally speaking, MFF-YOLOv7 performs outstandingly in underwater sonar image target detection and provides an efficient and reliable solution for this field.

3.5. Generalization Experiment

3.5.1. UATD Datasets

In today’s field of scientific research, the performance verification of models is a key link in promoting technological development. To explore the performance of the model in complex situations comprehensively and deeply, we carefully selected the UATD dataset to conduct experiments.

The UATD dataset is quite considerable in size, consisting of 9000 sonar images. Its data collection was accomplished with the aid of a Multi-Beam Forward-Looking Sonar (MFLS), and the adopted Tritech Gemini 1200ik sonar has high-resolution characteristics and can flexibly switch the operating frequency between 720 kHz (long distance) and 1200 kHz (short distance) for targets at different distances.

One of the highlights of this dataset lies in the authenticity of its collection environment. Some of the data are from Jinshitan, Dalian, where the water depth is in the shallow water environment ranges from 4 to 10 m; another part is from Haoxin Lake, Maoming, with a maximum water depth of 4 m. The data collected in such real marine and shallow water environments undoubtedly add more practical significance and challenges to the experiments.

More importantly, the UATD dataset covers a rich and diverse range of object categories, including ten types such as Cube, Ball, Cylinder, Human Body, Plane, Circle Cage, Square Cage, Metal Bucket, Tire, and ROV. These objects are imaged in the underwater environment with low visibility, and the complexity of their sonar images can be imagined. Precisely because the UATD dataset has the notable characteristics of low visibility and multiple objects, it becomes an ideal choice for us to verify the superiority of the model in complex situations. Experiments based on this dataset will also provide an extremely valuable basis and profound insights for the performance evaluation of the model.

The dataset is already split in training, testing and validation sets comprising 7600, 800, and 800 sonar images, respectively.

3.5.2. Experimental Results and Analysis

Table 5 presents a performance comparison between various deep learning models in the literature with MFF-YOLOv7, all trained on the UATD dataset [27]. Among all these models, MFF-YOLOv7 demonstrated superior performance, achieving the best precision, recall, and mAP50.

In the comparison of different object detection models, the precision of MFF-YOLOv7 reached 91.2%. Compared with 63.2% of RetinaNet and 74.3% of FasterRCNN, it has a relatively significant advantage, indicating that in the detection results, the proportion of correct predictions as positive examples is relatively higher, that is, the performance of detection accuracy is excellent, and the false alarm situation is relatively less. At the same time, compared with YOLOv3SPP (91.1%), which also has a high precision, MFF-YOLOv7 is slightly better, which reflects its excellent ability to accurately identify targets.

The recall rate of MFF-YOLOv7 is 88.9%. Compared with models such as 62.4% of RetinaNet and 81.0% of YOLOv8, it can better find out the actual positive examples, meaning that under the UATD dataset, the possibility of missing targets is relatively lower, and it has a good performance in comprehensively detecting all relevant targets, and has a wide coverage of targets.

The mAp@0.5 of MFF-YOLOv7 reached 87.2%. Compared with models such as 62.5% of RetinaNet and 75.1% of FasterRCNN, it has obvious advantages. Furthermore, when compared with other common and better-performing models such as YOLOv3 (79.1%) and YOLO-DCN (80.5%), it is also at a relatively high level. This reflects that its average precision at different recall rates is good. When considering the detection accuracy and recall situation comprehensively, the overall performance is relatively prominent, and it can complete the object detection task more stably and with high quality.

Overall, by comparing with multiple different object detection models on the UATD dataset, MFF-YOLOv7 has shown its superiority in several key indicators such as precision, recall rate, and mean average precision, proving its excellent performance in the complex underwater object detection scene (corresponding to the characteristics of the UATD dataset).

Figure 12 shows the confusion matrix of the UATD dataset obtained by comparing the predicted labels with the actual labels.

Figure 13 shows the label map (Figure 13a) and the predicted label map (Figure 13b). In the label map, different targets such as “cube”, “ball”, “circle cage”, etc., are clearly marked. The predicted label map shows the prediction results of the model for these targets and the corresponding probability values.

In the single-target scenario, such as in sub-images where only “cube” or “ball” exists, the model has significant advantages. When the model predicts a single target, the prediction probability is often high. Taking “cube” as an example, when the sub-image in the label map is marked as “cube”, the corresponding predicted label in the predicted label map is “cube”, and the probability value is mostly above 0.8, which indicates that the model has high accuracy in identifying single targets and has a strong confidence in recognizing individual targets.

Furthermore, the model performs well in distinguishing different types of single targets. The prediction probability for “ball” is generally high, which means that when dealing with single-target situations, the model can distinguish “cube” and “ball” well and rarely misjudge.

In the multi-target scenario, for example, when there are both “cube” and “circle cage”, or “ball” and “circle cage” in the sub-image, the model’s performance is satisfactory. It can be seen from the predicted label map that the model can accurately identify multiple targets one by one. For instance, in the sub-image with both “cube” and “circle cage”, the model can correctly label these two targets, and the probability value corresponding to each target is within a reasonable and acceptable range.

More importantly, when dealing with multiple targets, the model does not miss any targets. In all sub-images containing multiple targets, the model can completely identify all the targets, reflecting its good performance in the face of a complex multi-target environment.

Overall, whether in single-target or multi-target cases, the labels predicted by the model have a high consistency with the actual labels, and the predicted probability values are mostly at a relatively high level. This characteristic demonstrates the model’s outstanding advantages when dealing with single and multi-targets in the UATD dataset. In the complex underwater environment, this superiority is of great significance for target detection and recognition work, which can significantly improve the accuracy and reliability of detection and provide a strong guarantee for related work.

4. Limitations of the Proposed Method

4.1. Data Requirements and Adaptability Challenges

Although the MFF-YOLOv7 model incorporates certain strategies to handle data challenges, it still has notable limitations. The model requires a substantial amount of labeled data for effective training. Despite the efforts to address the small sample size issue, in scenarios where the underwater environment or target characteristics deviate significantly from the training data, the model’s performance may degrade. For example, when encountering new types of underwater terrains or previously unseen target species, the model might struggle to accurately detect and classify targets. This indicates that the model’s adaptability to novel underwater conditions and target variations needs improvement. Future research should focus on developing techniques to enhance the model’s generalization ability across diverse underwater scenarios and target types, potentially through advanced data augmentation methods or unsupervised learning strategies.

4.2. Noise Processing and Computational Efficiency

The MFF-YOLOv7 model has made efforts to address the noise problem in sonar images, but challenges remain. While the introduced modules such as RFAConv and the SCSA mechanism contribute to feature extraction and noise reduction, extremely high noise levels or complex noise patterns in underwater sonar images can still impact the model’s accuracy. Additionally, although the model aims to balance performance and computational complexity, its computational requirements are relatively high compared to some lightweight models. This restricts its application in resource-constrained underwater devices with limited processing power and memory. Future work should explore more efficient noise reduction algorithms and optimize the model’s architecture to reduce computational overhead, enabling its deployment in a wider range of underwater sensing systems.

4.3. Model Complexity and Generalization Ability

The complexity of the MFF-YOLOv7 model, with its multiple innovative modules like the Multi-Scale Information Fusion Module (MIFM) and the combination of different attention mechanisms, enhances its performance but also brings drawbacks. The intricate architecture increases the difficulty of training and debugging, slows down the model’s iteration speed, and reduces its maintainability. Moreover, in cases of severe data imbalance, the model’s generalization ability may be compromised. For instance, if certain target classes are severely underrepresented in the training data, the model might not perform optimally for those classes in real-world applications. Future research should strive to simplify the model’s structure without sacrificing performance, improve its generalization ability through more effective data handling techniques, and conduct comprehensive evaluations on a broader range of underwater sonar image datasets to ensure its stability and reliability in various practical scenarios.

By understanding these limitations, future research can be directed towards improving the model. This could involve developing more advanced noise suppression techniques, optimizing the model structure to reduce computational costs, exploring data augmentation and balancing strategies to enhance generalization ability, and improving the model’s interpretability. These efforts will enable the MFF-YOLOv7 model to play a more significant role in a wider variety of underwater target detection applications and contribute more effectively to the field of underwater imaging and sensing.

4.4. Dataset-Specific Performance and Improvement Directions

The MFF-YOLOv7 model has limitations despite its general superiority. For the URPC dataset, results are subpar, with low recall for scallops. This may stem from the dataset’s complex underwater environments and diverse target species, causing the model to struggle in generalization. To boost scallop recall, gathering more diverse, high-quality sonar images of scallops can enrich the dataset and help the model learn scallop features better. Optimizing the model’s architecture or training process, like modifying network layers or adjusting algorithms, is also essential to enhance sensitivity to scallop features.

The SCTD shows excellent results, likely due to a better fit between its features and the model’s design. However, a deeper analysis is needed to clarify the performance difference.

To improve overall underwater biota detection, other strategies include fine-tuning hyperparameters for target datasets to enhance accuracy and generalization. Advanced data preprocessing such as adaptive noise filtering and normalization can assist in handling input data. Incorporating domain knowledge by customizing feature extraction or attention mechanisms to match underwater organisms’ visual cues is beneficial. Understanding these limitations guides future research in developing better noise suppression, optimizing the model for lower computational costs, exploring data augmentation and balancing, and improving model interpretability, enhancing the model’s role in underwater target detection and its contribution to underwater imaging and sensing.

5. Conclusions

Our contributions can be summarized as follows:

We have introduced a series of new modules to enhance the model’s performance. The Multi-Scale Information Fusion Module (MIFM) has replaced the SPPCSPC in YOLOv7. It can integrate multi-scale information better, thereby strengthening the model’s ability to handle features of different scales. This improvement has effectively addressed the issue that due to the significant differences in the sizes of underwater targets, traditional modules have difficulties processing features of different scales. It has significantly improved the accuracy of sonar image target recognition accuracy and reduced the occurrences of missed and false detections.
The RFAConv has been introduced to replace the Conv in the CBS of ELAN. It boasts better feature extraction capabilities and is more adaptable to sonar images’ high-noise and unclear characteristics. As a result, it has significantly enhanced the model’s ability to learn and represent the features of sonar images, enabling it to extract helpful target features from noise more effectively.
Moreover, the SCSA mechanism has been introduced at three connection positions between the backbone network and the head. It helps the model focus more on important feature information and reduces the interference of irrelevant information. The introduction of the SCSA mechanism at three connection positions between the backbone network and the head further improves the recognition accuracy and robustness of the model, allowing it to focus on target features more accurately in complex underwater environments.

We have carried out detailed experiments on datasets like URPC, SCTD, and UATD, which cover attention mechanism comparison experiments, ablation experiments, and comparison experiments with other mainstream algorithms. Through these efforts, we have fully validated the effectiveness and superiority of the MFF-YOLOv7 model. The MFF-YOLOv7 model has significantly improved all metrics in these experiments. It has shown a more vital generalization ability and the capacity to precisely detect various underwater targets, providing a more reliable and accurate solution for underwater target detection.

Author Contributions

Conceptualization, Z.C. and K.Z.; methodology, H.Z. and G.X.; software, H.L.; validation, Z.C., H.Z. and H.L.; formal analysis, J.L., L.L. and Z.L.; data curation, J.L., L.L. and Z.L.; investigation, K.Z. and G.X.; writing—original draft preparation, K.Z.; writing—review and editing, H.Z. and Z.C.; funding acquisition, Z.C. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Guangxi Science and Technology Base and Talent Project (No. GuikeAD21220098) and the 2021 Open Fund project of the Key Laboratory of Cognitive Radio and Information Processing of the Ministry of Education (No. CRKL210102). We also thank to the support of Beihai City Science and Technology Bureau Project (No. Bei ke He 2023158004) and the Innovation Project of Guangxi Graduate Education (No. YCSW2024344) and the Innovation Project of GUET Graduate Education (No. 2024YCXS022, 2024YCXS033, 2023YCXS038).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MFF-YOLOv7	Multi-Gradient Feature Fusion YOLOv7 model
MIFM	Multi-Scale Information Fusion Module
SPPCSPC	Spatial Pyramid Pooling Channel Shuffling and Pixel-level Convolution
CBS	Convolution-Batch Normalization-SiLU activation function
ELAN	Efficient Layer Aggregation Network
RFAConv	Recurrent Feature Aggregation Convolution
SCSA	Spatial and Channel Synergistic Attention
UATD	Underwater Acoustic Target Detection Dataset
SCTD	Smaller Common Sonar Target Detection Dataset
URPC	The Underwater Optical Target Detection Intelligent Algorithm Competition 2021 Dataset
GCC-Net	Grouped Channel Composition Network.
YOLO	You Only Look Once
Conv	Convolution
BN	Batch Normalization
SiLU	Sigmoid Linear Unit
RFA	Receptive Field Attention
SMSA	Shared Multi-Semantic Spatial Attention
CBS	Convolution-Batch Normalization-SiLU activation function
PCSA	Progressive Channel Self-Attention
IOU	Intersection Over the Union
CBAM	Convolutional Block Attention Module
ECA	Efficient Channel Attention
SE	Squeeze-and-Excitation
SimAM	A Simple, Parameter-Free Attention Module for Convolutional Neural Networks
Biformer	Bidirectional Interactive Attention Transformer
Faster R-CNN	Faster Region-based Convolutional Neural Networks

References

Ahmad-Kamil, E.; Zakaria, S.Z.S.; Othman, M.; Chen, F.L.; Deraman, M.Y. Enabling marine conservation through education: Insights from the Malaysian Nature Society. J. Clean. Prod. 2024, 435, 140554. [Google Scholar] [CrossRef]
Khoo, L.S.; Hasmi, A.H.; Mahmood, M.S.; Vanezis, P. Underwater DVI: Simple fingerprint technique for positive identification. Forensic Sci. Int. 2016, 266, e4–e9. [Google Scholar] [CrossRef] [PubMed]
Fan, X.; Lu, L.; Shi, P.; Zhang, X. A novel sonar target detection and classification algorithm. Multimed. Tools Appl. 2022, 81, 10091–10106. [Google Scholar] [CrossRef]
Yin, Z.; Zhang, S.; Sun, R.; Ding, Y.; Guo, Y. Sonar image target detection based on deep learning. In Proceedings of the 2023 International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballar, India, 29–30 April 2023. [Google Scholar]
Wang, X.; Yuen, K.F.; Wong, Y.D.; Li, K.X. How can the maritime industry meet Sustainable Development Goals? An analysis of sustainability reports from the social entrepreneurship perspective. Transp. Res. Part D Transp. Environ. 2020, 78, 102173. [Google Scholar] [CrossRef]
Vijaya Kumar, D.T.T.; Mahammad Shafi, R. A fast feature selection technique for real-time face detection using hybrid optimized region based convolutional neural network. Multimed. Tools Appl. 2022, 82, 1–14. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Huang, C.; Zhao, J.; Zhang, H.; Yu, Y. Seg2Sonar: A Full-Class Sample Synthesis Method Applied to Underwater Sonar Image Target Detection, Recognition, and Segmentation Tasks. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5909319. [Google Scholar] [CrossRef]
Zhou, T.; Si, J.; Wang, L.; Xu, C.; Yu, X. Automatic detection of underwater small targets using forward-looking sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4207912. [Google Scholar] [CrossRef]
Xi, J.; Ye, X.; Li, C. Sonar image target detection based on style transfer learning and random shape of noise under zero shot target. Remote Sens. 2022, 14, 6260. [Google Scholar] [CrossRef]
Villon, S.; Mouillot, D.; Chaumont, M.; Darling, E.S.; Subsol, G.; Claverie, T.; Villéger, S. A deep learning method for accurate and fast identification of coral reef fishes in underwater images. Ecol. Inform. 2018, 48, 238–244. [Google Scholar] [CrossRef]
Guo, X.; Zhao, X.; Liu, Y.; Li, D. Underwater sea cucumber identification via deep residual networks. Inf. Process. Agric. 2019, 6, 307–315. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for realtime object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Al Muksit, A.; Hasan, F.; Emon, M.F.H.B.; Haque, M.R.; Anwary, A.R.; Shatabda, S. YOLO-Fish: A robust fish detection model to detect fish in realistic underwater environment. Ecol. Inform. 2022, 72, 101847. [Google Scholar] [CrossRef]
Liu, Z.; Wang, B.; Li, Y.; He, J.; Li, Y. UnitModule: A light-weight joint image enhancement module for underwater object detection. Pattern Recognit. 2024, 151, 110435. [Google Scholar] [CrossRef]
Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, S.; Wu, K.; Ning, M.; Chen, H.; Zhang, P. SCTD1.0: Common Sonar Target Detection Dataset. Ship Sci. Technol. 2021, 43, 54–58. [Google Scholar]
Dong, J.; Yang, M.; Xie, Z.; Cai, L. Overview of Underwater Image Object Detection Dataset and Detection Algorithms. J. Ocean. Technol. 2022, 41, 60–72. [Google Scholar]
Xie, K.; Yang, J.; Qiu, K. A dataset with multibeam forward-looking sonar for underwater object detection. Sci. Data 2022, 9, 739. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. arXiv 2020, arXiv:1910.03151v4. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2019, arXiv:1709.01507v4. [Google Scholar]
Yang, L.; Zhang, R.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. arXiv 2023, arXiv:2303.08810v1. [Google Scholar]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention. arXiv 2024, arXiv:2407.05128v1. [Google Scholar]
Wu, W.; Luo, X. Sonar Object Detection Based on Global Context Feature Fusion and Extraction. In Proceedings of the 2024 12th International Conference on Intelligent Control and Information Processing (ICICIP), Nanjing, China, 8–10 March 2024; pp. 195–202. [Google Scholar]
Mehmood, S.; Irfan Muhammad, H.U.H.; Ali, S. Underwater Object Detection from Sonar Images Using Transfer Learning. In Proceedings of the 2024 21st International Bhurban Conference on Applied Sciences Technology (IBCAST), Murree, Pakistan, 20–23 August 2024; pp. 1–2. [Google Scholar]
Xue, G.; Zhang, J.; Wang, K.; Ma, D.; Weichen, P.; Hu, S.; Yang, Z.; Liu, T. Application of YOLOv7-tiny in the detection of steel surface defects. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, Xi’an, China, 26–28 January 2024; pp. 718–723. [Google Scholar] [CrossRef]
Glenn, J. Yolov8. Available online: https://github.com/ultralytics/ultralytics/tree/main (accessed on 16 November 2024).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Glenn, J. Yolov5 Release v7.0. Available online: https://github.com/ultralytics/yolov5/tree/v7.0 (accessed on 16 November 2024).
Wang, Z.; Guo, J.; Zeng, L.; Zhang, C.; Wang, B. MLFFNet: Multilevel feature fusion network for object detection in sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5119119. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Hou, J. Underwater Detection using Forward-Looking Sonar Images based on Deformable Convolution YOLOv3. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication (NNICE), Gaungzhou, China, 19–21 January 2024; pp. 490–493. [Google Scholar]
Pebrianto, W.; Mudjirahardjo, P.; Pramono, S.H.; Rahmadwati; Setyawan, R.A. YOLOv3 with Spatial Pyramid Pooling for Object Detection with Unmanned Aerial Vehicles. arXiv 2023, arXiv:2305.12344. [Google Scholar]

Figure 1. Structure of the YOLOv7.

Figure 2. The structure diagram of MIFM.

Figure 3. RFAConv Structure diagram.

Figure 4. The structure diagram of the SCSA mechanism. The variable n represents the number of groups the sub-features are divided into, and 1P represents a single pixel.

Figure 5. The MFF-YOLOv7 network.

Figure 6. The sample information of the URPC dataset. (a) presents statistical information such as category distribution, box size, centroid position, and aspect ratio of the dataset for analyzing data characteristics. (b) shows example images from the dataset, presenting the actual situation of underwater scenes.

Figure 7. The confusion matrices of the models under the URPC dataset are as follows: (a) is the confusion matrix of YOLOv7, and (b) is the confusion matrix of MFF-YOLOv7.

Figure 8. The PR curve of the model under the URPC dataset (a) is the PR curve of YOLOv7 and (b) is the PR curve of MFF-YOLOv7.

Figure 9. The recognition results of the model in the underwater scenes of the URPC dataset. (a) shows the recognition results of the YOLOv7 model, indicating the problems of missed detections, false detections, and interference from the environment in multi-target and single-target detections. (b) is the recognition result of the MFF-YOLOv7 model, reflecting the advantages of this model in accurately detecting various targets and balancing accuracy and recall rate.

Figure 10. The SCTD sonar images dataset.

Figure 11. MFF-YOLOv7 PR plots of the SCTD.

Figure 12. Confusion Matrix Results for UATD Dataset.

Figure 13. (a) Test set labels. (b) Predicted bounding boxes of the test set.

Table 1. Experimental contrasts of attention mechanisms based on YOLOv7.

Method	AP_echinus	AP_starfish	AP_holothurian	AP_scallop	$m A P$ $@ 0.5$	mAP@0.5:0.95	$Precision (%)$	$Recall (%)$
YOLOv7 [20]	85.0%	89.6%	78.5%	35.4%	72.1%	42.6%	82.1	66.9
YOLOv7-CBAM [28]	85.3%	89.8%	79.3%	34.5%	72.2%	42.6%	82.0	67.1
YOLOv7-ECA [29]	85.0%	89.5%	79.5%	36.2%	72.6%	42.8%	82.1	68.1
YOLOv7-SE [30]	85.5%	89.4%	78.6%	35.7%	72.3%	42.7%	83.0	66.4
YOLOv7-SimAM [31]	85.2%	89.9%	79.2%	36.2%	72.7%	42.8%	82.0	67.1
YOLOv7-Biformer [32]	85.3%	90.1%	79.8%	36.3%	72.9%	42.9%	81.4	66.5
YOLOv7-SCSA [33]	86.0%	90.8%	84.7%	37.7%	74.8%	43.5%	84.9	68.9

Table 2. Ablation comparison of model performance improvements on the URPC dataset.

Model	RFAConv	SCSA	MIFM	Precision	Recall	mAP@0.5	mAP@0.5:0.95
YOLOv7				82.1%	66.9%	72.1%	42.6%
	√			85.1%	67.2%	73.1%	42.7%
	√	√		88.2%	69.3%	75.9%	44.3%
	√	√	√	89.9%	73.0%	79.1%	45.0%

The √ represents that this module is used in the model.

Table 3. Performance comparison of target detection models on the URPC dataset.

Model	Precision (%)	Recall (%)	mAp@0.5 (%)	mAp@0.5:0.95 (%)
YOLOv5s [34]	80.4	65.5	68.9	38.1
YOLOv5m [35]	82.6	65.2	69.5	41.2
YOLOv7 [20]	82.1	66.9	72.1	42.6
YOLOv7-Tiny [36]	82.7	63.6	69.6	36.9
YOLOv7-SDBB [20]	82.0	66.8	72.4	43.4
YOLOv8n [37]	80.1	64.8	68.6	38.6
YOLOv9 [21]	82.1	64.8	71.0	42.0
MFF-YOLOv7	89.9	73.0	79.1	45.0

Table 4. Accuracy of various underwater target detection models built in the SCTD.

Method	AP_ship	AP_plane	AP_human	mAp@0.5	mAp@0.5:0.95
SSD [38]	86.2%	86.8%	86.1%	86.4%	43.0%
Faster R-CNN [17]	88.2%	86.8%	87.3%	87.5%	43.8%
YOLOv3 [17]	87.3%	89.0%	86.1%	87.5%	47.8%
YOLOv4 [18]	89.2%	87.9%	87.3%	88.2%	49.6%
YOLOv5 [39]	90.2%	90.1%	89.9%	90.1%	56.6%
YOLOv7 [20]	89.2%	89.0%	89.9%	89.3%	54.0%
YOLOv8 [37]	89.2%	90.1%	89.9%	89.7%	54.7%
MFF-YOLO v7	96.3%	99.9%	99.9%	98.7%	63.2%

Table 5. Comparison of Performance of different object detection models and MFF-YOLOv7 on UATD Dataset.

Model	Precision (%)	Recall (%)	mAp@0.5 (%)
RetinaNet [40]	63.2	62.4	62.5
Faster R-CNN [17]	74.3	75.3	75.1
YOLOv3 [17]	85.4	82.1	79.1
SDD Net [41]	81.3	79.7	80.2
YOLO-DCN [42]	86.2	83.4	80.5
YOLOv3SPP [43]	91.1	93.0	92.2
YOLOv8 [37]	85.4	81.0	83.3
MFF-YOLOv7	91.2	88.9	87.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, K.; Liang, H.; Zhao, H.; Chen, Z.; Xie, G.; Li, L.; Lu, J.; Long, Z. Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection. J. Mar. Sci. Eng. 2024, 12, 2326. https://doi.org/10.3390/jmse12122326

AMA Style

Zheng K, Liang H, Zhao H, Chen Z, Xie G, Li L, Lu J, Long Z. Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection. Journal of Marine Science and Engineering. 2024; 12(12):2326. https://doi.org/10.3390/jmse12122326

Chicago/Turabian Style

Zheng, Kun, Haoshan Liang, Hongwei Zhao, Zhe Chen, Guohao Xie, Liguo Li, Jinghua Lu, and Zhangda Long. 2024. "Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection" Journal of Marine Science and Engineering 12, no. 12: 2326. https://doi.org/10.3390/jmse12122326

APA Style

Zheng, K., Liang, H., Zhao, H., Chen, Z., Xie, G., Li, L., Lu, J., & Long, Z. (2024). Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection. Journal of Marine Science and Engineering, 12(12), 2326. https://doi.org/10.3390/jmse12122326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection

Abstract

1. Introduction

2. Background

2.1. YOLOv7

2.2. Multi-Scale Information Fusion Module (MIFM)

2.3. RFAConv

2.4. Spatial and Channel Synergistic Attention (SCSA) Module

2.5. MFF-YOLOv7

3. Experimental Design and Experimental Analysis

3.1. Experimental Environment

3.2. Experimental Indicators

3.3. Experimental Results and Analysis of the URPC Dataset

3.3.1. Attention Mechanism-Contrast Trials

3.3.2. Ablation Experiment

3.3.3. Contrasting Experiments with the Other Algorithms

3.4. Experimental Results and Analysis of the SCTD

Model Generalizability Experiment

3.5. Generalization Experiment

3.5.1. UATD Datasets

3.5.2. Experimental Results and Analysis

4. Limitations of the Proposed Method

4.1. Data Requirements and Adaptability Challenges

4.2. Noise Processing and Computational Efficiency

4.3. Model Complexity and Generalization Ability

4.4. Dataset-Specific Performance and Improvement Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI