1. Introduction
The ocean contains abundant resources, such as oil, natural gas, seafood, etc., which are being explored to satisfy the fast development of human society. However, the harsh marine environment is not friendly to human beings, and marine vehicles equipped with underwater target detection techniques are increasingly utilised for various marine activities, such as mine exploration, underwater aquaculture, marine rescue, etc. [
1,
2].
Most of the existing object detection methods are mainly applied to terrestrial environments. Benefiting from large quantities of high-quality training samples, it is relatively easy to obtain a high-precision network model for target detection. However, model training for underwater target detection is much more difficult due to the complex underwater environments (e.g., turbid water, uneven lighting, ocean current interference, etc. [
3]) and insufficient datasets.
The existing object detection methods can be generally divided into two categories: One is target detection based on hand-crafted features, and another is target detection based on deep learning [
4]. The former mainly relies on the manual feature design of the targets to be detected. These features are often intuitive, such as colour, texture, shape, etc., and people can directly identify these features. Some common detection methods based on hand-crafted features include Local Binary Pattern (LBP) [
5], Scale-Invariant Feature Transform (SIFT) [
6], Histogram of Oriented Gradients (HOG) [
7], etc. Villon et al. combined HOG and SVM to detect coral reef fish in the collected images to solve the accuracy degradation caused by target occlusion and overlap in underwater images [
8]. Although the target detection algorithm based on hand-crafted features has high accuracy for the specific targets, it largely depends on the designer’s knowledge of related fields. Moreover, the subjective views of designers also affect the design of features to a certain extent. In the feature design process, there is a need to consume a lot of manpower and material resources, which inevitably causes certain losses.
In recent years, deep learning research has gradually gained social recognition. The application of deep learning in the field of target detection has gradually become a research hotspot. Compared with target detection based on hand-crafted features, target detection based on deep learning has the advantages of wider application fields, convenient design, and simple dataset production, which can save a lot of manpower and material resources. Target detection approaches based on deep learning can be divided into region proposal-based and regression-based algorithms. In region proposal-based target detection algorithms, a Region Proposal Network (RPN) is included in the model to generate the candidate object bounding boxes to improve object detection and classification. Some typical region proposal-based target detection models include Fast R-CNN [
9], Faster R-CNN, Mask R-CNN [
10], R-FCN [
11], etc. Zeng et al. combined the Faster R-CNN network and the adversarial occlusion network to construct a new network named Faster R-CNN-AON network [
12]. The network works well in suppressing the overfitting that can occur during model training. Song et al. improved the Mask R-CNN network, and the improved model could perform well in target detection in complex underwater environments [
13]. Although the proposed target detection algorithm based on region proposal has relatively high detection accuracy, it is quite time-consuming for underwater target detection and is not suitable for real-time detection tasks.
Compared with the target detection approaches based on region proposal, the ones based on regression do not have a Region Proposal Network (RPN) but directly generate the corresponding predicted boxes on the input image to detect the target. Therefore, regression-based target detection approaches have faster detection speed and lower hardware requirements and are suitable for portable devices. Some typical regression-based target detection models are YOLO [
14], YOLO9000 [
15], YOLOv3 [
16], YOLOv4 [
17], SSD [
18], etc. Chen et al. proposed an improved YOLOv4 model named YOLOv4-UW [
19]. The authors removed the large-sized output layer and the SPP structure from the original model and used the deconvolution module to replace the original upsampling module. The improved model increases the speed of the detection of specific objects. Yao et al. improved the SSD model [
20]. The authors designed two residual units to replace the depth separable convolution in the original model as the feature extractor of the model, which greatly reduced the number of parameters of the model, thus speeding up the training speed of the model. Therefore, regression-based detection algorithms are less consuming and are suitable for real-time detection tasks. However, their detection accuracy is relatively low. Moreover, due to the limitations of the hardware equipment of underwater robots, most of the existing underwater target detection tasks are still dominated by regression-based target detection algorithms.
Algorithm detection accuracy has been improved without sacrificing detection speed with improvements in the original regression-based target detection algorithms. Among the various improvements, embedding the attention module into the model is one of the most effective methods. The principle of the attention mechanism is to focus most of the computer’s computing power on important objects and ignore the unimportant parts of the image, such that the detection model can more efficiently detect the objects in the picture. Li et al. aimed to detect small and blurry targets in infrared images [
21]. Most of the targets are small and blurry. They embedded the SK attention module in the original YOLOv5 model to obtain good accuracy in the detection of blurred objects in complex environments. The YOLOv5 algorithm is a regression-based target detection model with multiple modes. Due to its simple model and few parameters, when applied to underwater target detection, it can achieve good real-time performance and can be easily deployed in various portable underwater robots. However, there are still no instances of applying the YOLOv5 model to underwater target detection. In this work, we embedded the attention mechanism into the YOLOv5 model to improve the detection accuracy in complex underwater environments without greatly reducing the detection speed.
The main contributions of this work are summarised as follows: Firstly, the number of bottlenecks in the first C3 module of the model was increased from the original one to three to improve the capability of shallow feature extraction; thereby, the model can collect more subtle features. Next, some C3 modules were modified, and the CA attention module was embedded in the C3 modules to improve the model attention focused on the region of interest; thereby, the model can concentrate the computing power on the region of interest. Finally, the SE layer was added to the specific position of the model to further strengthen the model attention focused on the region of interest.
The rest of this paper is as follows: In
Section 2, the original YOLOv5s model is briefly introduced, and the corresponding innovations proposed in this paper are described.
Section 3 describes the experimental platform, parameter settings, etc., and the experimental results are recorded and analysed. Finally,
Section 4 concludes this paper with some prospective related works.
2. Modified YOLOv5s Model
In this section, an overview of the YOLOv5s network is first presented. Next, based on the basic YOLOv5s structure, the CA and SE modules employed to modify the YOLOv5s model, named YOLOv5s-CA, to improve the accuracy of underwater target detection are introduced.
Figure 1 shows the diagram of the process for applying the modified YOLOv5s model to underwater target detection.
2.1. Basic YOLOv5s Model
The YOLOv5 model is a regression-based target detection model released by Glenn Jocher in 2020. The YOLOv5 model was developed based on target detection models such as YOLOv3 and YOLOv4. Compared with the previous models, the YOLOv5 model improves the detection accuracy of the model while maintaining the detection speed. The YOLOv5 model has four structures: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The difference among these four structures is mainly the difference in the number of convolution kernels and bottlenecks in specific parts. The parameters of the four models increase in turn, and accordingly, the detection accuracy of the model also increases in turn, but the detection speed of the model decreases accordingly. In the actual underwater environment, the hardware equipment of underwater robots is limited. Thus, strict restrictions on the model size are necessary, which motivated this work to use the YOLOv5s model as the experimental object.
The YOLOv5s model is mainly divided into three parts: backbone network, neck network, and detect network. The backbone network is a convolutional neural network to generate feature maps of different sizes. The backbone network comprises the Focus module, the Conv module, the C3 module, and the Spatial Pyramid Pooling (SPP) module. The neck network adopts the structure of Feature Pyramid Network (FPN) + Path Aggregation Network (PAN). It fuses low-level spatial features with high-level semantic features generated by the backbone network using bidirectional fusion. It then inputs the generated feature maps into the detect network. The detect network uses anchor boxes to operate on the input feature map to generate detection boxes to indicate the target type, location, and confidence in the image.
Figure 2 shows the overall network structure of the YOLOv5s model.
2.2. Attention Mechanism
In humans, when observing objects, the human visual system tends to focus on the most important objects while ignoring the parts of the line of sight that are not important for identifying objects [
22]. The attention mechanism [
23] is similar to that of the human visual system, as it can tell the object detection model which objects are important and where they are located in the acquired image [
24]. In existing studies, a large number of researchers have embedded the attention mechanisms into deep neural networks and achieved good experimental results [
25]. Some common examples of embedding the attention mechanisms into the deep neural networks include object classification [
26], image segmentation [
27], target detection [
28], etc.
Some typical attention modules include the SK module [
29], the CBAM module [
30], the TA module [
31], etc. The CBAM module, proposed in 2018, is an attention module that can be incorporated into feed-forward convolutional neural networks. It consists of two independent sub-modules, the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). The CAM focuses on the distribution relationship of feature map channels, while the SAM allows the neural network to pay attention to the regions of the image that are the most relevant for classification. By considering both spatial and channel attention, the CBAM module can enhance the performance of the model.
The SK module, proposed in 2019, is another Channel Attention Module commonly used in vision models. It utilizes a convolution kernel mechanism that assigns different levels of importance to convolution kernels for different input images. By using convolution kernels of different scales, the SK module can extract features from input feature maps and generate channel attention information to enhance the channel features without changing the dimensionality.
The Squeeze-and-Excitation (SE) network module, proposed by Momenta in 2017, is another attention module that won the image recognition championship in the ImageNet competition [
32]. The SE module models the interdependence among feature channels and learns the importance of each channel to weight the features, highlighting important features and suppressing unimportant ones. The SE module improves the expression ability of the model, enhancing the detection ability of blurred images while remaining lightweight and not imposing a significant computational burden on the model. It can be easily integrated into various network frameworks and achieves significant performance improvements at a slight computational cost. Embedding the SE module in the appropriate position of a target detection model can greatly improve the detection accuracy of the model.
Figure 3 shows the structure of the SE attention module.
The Coordinate Attention (CA) attention module [
33] is an attention module proposed by Hou et al. in 2021. Compared with the previous modules, such as the CBAM module, the CA attention module has made satisfactory progress in deep neural networks. The CA attention module embeds the target location information in the image into the channel attention so that the detection model can better locate the target and detect it. The specific operation of the CA attention module can be divided into coordinate information embedding and Coordinate Attention generation. Instead of compressing the feature tensor into a single feature vector with 2D global pooling, the CA attention module decomposes the input feature tensor into two 1D feature coding processes in two different directions (horizontal and vertical directions), respectively. These two 1D feature encoding processes allow the model to capture long-range dependencies among channels along one direction and maintain the location information of objects along the other. At this point, the target location information is saved to the generated attention map and then added to the input feature map with a multiplication operation for the next convolution operation.
Figure 4 shows the structure of the CA attention module.
2.3. YOLOv5s-CA Model
In the complex underwater environment, since the collected images are often blurred, if the target detection model is directly applied to target detection tasks, the detection accuracy is greatly reduced. By expanding the number of Bottleneck modules in the first C3 module, the capability of low-level feature extraction can be improved, thereby improving the detection performance of blurred objects. Moreover, the attention module is embedded into the model, which can further enhance the performance of the model in the detection of blurred or small targets.
2.3.1. Extended Bottleneck Module
There are many convolutional layers in the backbone network of YOLOv5s for the feature extraction of targets. As the network performs multiple convolutions on an image, the high-level features of the image are extracted for subsequent processing. These high-level features contain plenty of semantic information but have very low resolution, resulting in low accuracy in the detection of blurred objects. In contrast, in the shallow modules of the network, the model is able to perform better feature extraction of low-level features and obtain better resolution. For small and weak targets with few features in underwater images, deep convolution may cause difficulty in feature extraction or even feature loss. To maximise the features of blurred objects in underwater images, it is necessary to fully use the high-resolution features extracted by the shallow network. Therefore, it is reasonable to process the shallow modules of the YOLOv5s model. We expanded the number of bottlenecks in the first C3 module from the original one to three such that the number of bottlenecks in the first C3 module is the same as the number of bottlenecks in the following C3 modules. By expanding the number of bottlenecks in the first C3 module, the performance in extracting low-level features can be improved.
Figure 5 shows the schematic diagram of the first C3 module here improved.
As can be seen in
Figure 5, in the first C3 module, there are two branches that perform feature extraction. Then, the obtained features merge to obtain a richer feature combination. By increasing the number of bottlenecks in the first C3 module, the extraction of low-level features can be improved without significantly increasing the complexity of the model; as a result, the model can better detect blurred targets.
2.3.2. Improved C3 Module
By integrating the attention mechanism into the target detection model, the model ability to focus on objects in the image can be enhanced, enabling it to allocate most of the computational resources to the target location. In this paper, a novel attention module, namely, the CA attention module, is proposed. This module was designed to preserve the positional information of the target object in the image and incorporate it into the model convolution operation.
This paper combined the CA attention module and the C3 module of the YOLOv5s model to replace the original C3 module, aiming to improve the performance of the detection model in the detection of blurred underwater targets. The modification combination is to add a CA module between the Conv module and the Bottleneck module in the trunk branch of the C3-True module to improve the capability of performing shallow feature extraction. The second is to use the CA module to replace the Bottleneck module in the trunk branch of the C3-False module. The second combination method is mainly adopted to reduce the number of parameters of the model while improving the detection accuracy of the model without greatly decreasing the detection speed.
The appropriate reduction ratio in the original CA attention module is used to reduce the number of channels. In this paper, we modified the value of the reduction ratio from the original 32 to 8 to reduce the number of parameters to a certain extent. The schematic diagram of the first combination mode, named the C3-1 module, is shown in
Figure 6. The schematic diagram of the second combination mode, called the C3-2 module, is shown in
Figure 7.
2.3.3. Improved Backbone Network
Due to the complexity of the underwater environment, the quality of the collected underwater pictures is often not high, and so is that of the targets in the pictures. In addition, a variety of targets need to be detected in one picture. Therefore, the detection model should extract the features of each target from different complex feature maps. Thus, we embedded the SE (Squeeze-and-Excitation) attention module into a suitable position in the backbone network of the YOLOv5s model to adaptively adjust the channel weights according to the convolutional input without greatly increasing the model complexity.
In this paper, the SE module was added to the output of some C3 modules in the backbone network to enhance the feature extraction capability of the YOLOv5s model. The experiments conducted in this study revealed that incorporating an SE attention module following the C3 module can enhance the model ability to concentrate on the target, as well as extract more comprehensive features from the feature map obtained from the C3 module. As a result, the model can extract richer features. This combination allows the model to recalibrate the channels on any convolutional layer before the features are passed to the subsequent convolutional layers. With this improvement, the detection model can suppress the unimportant objects in the image well and enhance the performance in the detection of blurred objects.
To efficiently extract the shallow features of images, this paper expanded the number of bottlenecks in the first C3 module. Moreover, by embedding the CA attention module into some C3 modules, the C3 modules were improved, such that the model can improve the attention focused on the important objects in the image and reduce the computational cost. Finally, the SE attention module was embedded in the appropriate position of the backbone network, such that the model can improve the attention focused on important objects in the image while ignoring the unimportant positions in the image.
Figure 8 shows the structural diagram of the modified backbone network. We named the modified YOLOv5s network the YOLOv5s-CA model. Corresponding verification experiments were carried out, as reported in the next section, to verify the effectiveness of the method proposed in this paper.