Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression

Li, Weishan; Li, Yilin; Li, Ruixue; Shen, Haozhe; Li, Wenjun; Yue, Keqiang

doi:10.3390/jmse12101760

Open AccessArticle

Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression

by

Weishan Li

,

Yilin Li

^*

,

Ruixue Li

,

Haozhe Shen

,

Wenjun Li

and

Keqiang Yue

School of Electronic Information, Hangzhou Dianzi Unvercity, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(10), 1760; https://doi.org/10.3390/jmse12101760

Submission received: 20 August 2024 / Revised: 13 September 2024 / Accepted: 30 September 2024 / Published: 4 October 2024

(This article belongs to the Special Issue Maritime Communication Networks and 6G Technologies)

Download

Browse Figures

Versions Notes

Abstract

Large-scale deep learning algorithms have emerged as the primary technology for underwater target detection, demonstrating exceptional inference effectiveness and accuracy. However, the real-time capabilities of these high-accuracy algorithms rely heavily on high-performance computing resources like CPUs and GPUs. It presents a challenge for deploying them on underwater embedded devices, where communication is limited and computational and energy resources are constrained. To overcome this, this paper focuses on constructing a lightweight yet highly accurate deep learning model suitable for real-time underwater target detection on edge devices. We develop a new lightweight model, named YOLO-TN, for real-time underwater object recognition on edge devices using a self-constructed image dataset captured by an underwater unmanned vehicle. This model is obtained by compressing the classical YOLO-V5, utilizing a globally differentiable deep neural architecture search method and a network pruning technique. Experimental results show that the YOLO-TN achieves a mean average precision (mAP) of 0.5425 and an inference speed of 28.6 FPS on embedded devices, while its parameter size is between 0.4 M and 0.6 M. This performance is a fifth of the parameter size and twelve times the FPS of the YOLO-V5 model, with almost no loss in inference accuracy. In conclusion, this framework significantly enhances the feasibility of deploying large-scale deep learning models on edge devices with high precision and compactness, ensuring real-time inference and offline deployment capabilities. This research is pivotal in addressing the computational challenges faced in underwater operations.

Keywords:

target detection; model compression; underwater embedded system; deep learning

1. Introduction

With the development of artificial intelligence, humanity is increasingly relying on highly automated unmanned underwater devices to assist in underwater operations, such as ocean exploration, resource exploration, and environmental protection [1,2,3]. Therefore, the development of an underwater object recognition (UOR) algorithm has increasingly attracted attention. However, the underwater environment presents unique challenges for object recognition, being characterized by poor lighting, turbid water, and physical properties significantly different from terrestrial settings. These conditions severely impact the efficacy of conventional recognition systems, necessitating specialized approaches for reliable performance.

In the realm of underwater object recognition, sonar and imaging systems are the two primary technologies that provide direct and effective methods for identifying and mapping underwater environments and objects. Sonar is widely used for deep-sea navigation, military defense, and mapping, thriving in low-visibility conditions [4,5,6,7]. Conversely, optical imaging excels in well-lit shallow waters, and is favored for its ability to capture detailed information, making it widely used in various marine engineering applications such as environmental monitoring, fisheries management, and underwater exploration and monitoring.

In underwater environments, the quality of the optical image is significantly affected by factors such as light scattering, absorption, and particles. In response, underwater image enhancement methods have been developed, including conventional physical and statistical methods [8,9,10], data-driven convolutional neural networks, and generative algorithms [11,12,13]. As a crucial step in enhancing the performance of underwater object recognition, it directly promotes the effectiveness and practicality of recognition algorithms by improving image quality. In this work, based on the characteristics of our self-constructed underwater image dataset, we choose simple and effective image enhancement methods to meet the requirements for deployment on terminal devices.

The development of UOR algorithms is driven by advancements in computer vision technologies. Existing UOR algorithms can be roughly divided into two types. The first type is handcrafted feature-based algorithms, which rely on domain experts to design features that effectively represent the target based on the characteristics of the specific task or dataset. For example, color features, texture features, shape features, and their fusion, combined with various machine learning methods and statistical models, have been widely used for underwater object recognition [14,15,16,17,18]. The second one comprises deep learning methods, which have gradually become mainstream for UOR with the development of high-performance hardware devices and computer vision technologies [19]. By contrast, deep-learning-based target recognition algorithms exhibit high detection accuracy and strong generalization capabilities, such as the two-stage R-CNN and its derivative approaches [20,21,22,23,24,25,26,27], as well as the single-stage YOLO series algorithm [28] and SSD method [29] and so forth. As a benefit of this, underwater target recognition has also made significant progress [1,2,3]. However, the unparalleled inference performance of deep neural networks comes at the cost of vast memory consumption and high computational complexity. For instance, in two-stage object detection algorithms, the simplest AlexNet [30] requires 61 million parameters, 731 million floating-point operations, and 233 MB of memory. And even though some simplified one-stage object recognition methods could meet real-time requirements on some edge devices (

F P S \geq

25 [31]), there is still potential to reduce computational complexity and model size, providing more possibilities for offline deployment on terminal devices. In underwater target recognition, electromagnetic waves experience significant attenuation in the underwater environment, making it difficult to transmit data to servers for processing. In addition, unmanned underwater devices equipped with underwater target systems are typically resource-constrained embedded devices. Implementing data collection, transmission, storage, preprocessing, and object recognition within such limited resource constraints imposes stricter requirements on both the size and accuracy of the object recognition model.

Deep neural network model compression techniques can reduce the number of model parameters, save the model’s memory, and accelerate computations. Recently, there has been considerable progress in model-compression-related work, with various methods such as network pruning [32,33,34,35], parameter quantization [36,37,38,39], lightweight module design [40,41], and knowledge distillation continuously emerging [42,43,44,45,46,47,48,49]. In previous work conducted by our research group, a globally differentiable deep neural architecture search method [50] was proposed for lightweight classical CNN models. This method introduces the knowledge of the teacher network into the network search, then automatically searches for a student network more sensitive to the teacher network’s knowledge. The student network is able to replace the teacher network to perform the original task in some resource-constrained devices. On the basis of 0.7% of the original model, the accuracy of the student model constructed by this method can be kept within 1%.

In this study, utilizing this network architecture search algorithm based on global gradients, we design a lightweight underwater target recognition model constructed through a teacher-guided neural architecture search named YOLO-TN. This model is successfully deployed on embedded devices underwater, ensuring both high accuracy in model detection and resolution of computational conflicts. Furthermore, the construction and processing methods for a realistic underwater environment dataset proposed in this paper address challenges associated with imbalanced target quantities, a singular image environment, and unclear images in real underwater environments.

The main contributions of this paper are summarized as follows:

A lightweight model named YOLO-TN is designed for underwater target recognition. Experimental results demonstrate that the YOLO-TN model achieves an mAP@0.5 of 0.5425 after extreme compression with an input size of 416 × 416. This represents a reduction of less than 1% compared to YOLO-V5s. Moreover, the inference FPS reaches 28.8 on the CPU, effectively realizing high accuracy and lightweight characteristics for the underwater target recognition model. This ensures the feasibility of offline deployment and real-time inference of the model.
Through the construction and processing methods of a realistic underwater environment dataset, issues such as imbalanced target quantities and a singular image environment in existing underwater datasets are mitigated. Additionally, we analyze challenges present in underwater datasets, including inadequate underwater illumination, image degradation, and blurring. For different real underwater environments, this paper proposes preprocessing techniques, including dark channel dehazing, underwater image color restoration, and automatic color balance algorithms. These methods effectively enhance the quality of underwater datasets, improving the model’s generalization.
Deploying the YOLO-TN model to the Jetson TX2 embedded platform using the MNN inference engine, a real-time underwater target offline recognition system is established. Test results indicate that the FPS of the YOLO-TN network ensures real-time performance. The pruned YOLO-TN achieves an FPS of 28.6 and 20.4 at input sizes of 320 × 320 and 416 × 416, respectively.

The remainder of the paper is organized as follows: Section 2 provides the method of building the model. Section 3 introduces the model’s performance on public datasets. In Section 4, the model is deployed on a device and evaluated in real experiments. Section 5 concludes and discusses this paper.

2. Model Construction

In this work, we employ a globally differentiable deep neural architecture search method and integrate the teacher network’s knowledge into the search process. The method automatically identifies a student network that learns more from the teacher network. This student network can replace the teacher network on resource-constrained devices while maintaining similar performance. We use the Differentiable ARchiTecture Search (DARTS) method for the architecture search [51]. Unlike inefficient black-box searches, DARTS leverages gradient-based optimization to achieve comparable performance with fewer computational resources. This lightweight architecture search builds a compact model with low storage requirements, low computational complexity, and high performance. Parameter pruning further reduces redundancies in the trained model, resulting in a small, fast, and accurate offline recognition model for underwater targets. This improved lightweight model is called “YOLO-TN”.

In the search process using the globally differentiable deep network architecture search compression algorithm [50], the search space is defined as having 3 × 3 depthwise separable convolution, 5 × 5 depthwise separable convolution, 3 × 3 dilated convolution, 5 × 5 dilated convolution, 3 × 3 max pooling, 3 × 3 average pooling, and skip connections. This determines the operations between different nodes in the basic computing unit. To better recognize small targets, operations in the search space minimize the size and stride of convolution kernels, avoiding the omission of feature extraction for some small targets. Therefore, all operation strides are set to 1 in the search space. The final architecture consists of operations present in the search space. A schematic diagram of the method used in this paper is shown in Figure 1.

2.1. Loss Function

The total loss function for the network is represented as follows:

L_{Y O L O} = \sum_{i}^{N} (λ_{1} L_{C I o U} + λ_{2} L_{c o n f} + λ_{3} L_{c l s})

(1)

where

L_{C I o U}

,

L_{c o n f}

, and

L_{c l s}

denote the bounding box loss, confidence loss, and classification loss, respectively.

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weights for the three losses. To prevent “learning” from predicting the background region, in YOLO’s distillation, the distillation loss is formulated as an objective scale function. The student network learns bounding box coordinates and target class probabilities only when the teacher network predicts a high confidence for an object. The guided confidence loss for distillation is represented as follows:

\begin{matrix} L_{c o n f}^{C o m b} (o_{i}^{g t}, {\hat{o}}_{i}, o_{i}^{T}) = \underset{Distection loss}{\underset{︸}{L_{c o n f} (o_{i}^{g t}, {\hat{o}}_{i})}} + \underset{Distillation loss}{\underset{︸}{λ_{D} \cdot L_{c o n f} (o_{i}^{T}, {\hat{o}}_{i})}} \end{matrix}

(2)

where

{\hat{o}}_{i}

is the predicted confidence of the student network,

o_{i}^{g t}

is the true confidence label, and

o_{i}^{T}

is the predicted confidence of the teacher network. The classification loss for the student network is as follows:

\begin{matrix} L_{c l s}^{C o m b} (p_{i}^{g t}, {\hat{p}}_{i}, p_{i}^{T}, \hat{o_{i}^{T}}) = L_{c l s} (p_{i}^{g t}, {\hat{p}}_{i}) + \hat{o_{i}^{T}} \cdot λ_{D} \cdot L_{c l s} (p_{i}^{T}, {\hat{p}}_{i}) \end{matrix}

(3)

where

{\hat{p}}_{i}

is the predicted probability of the student network containing the target in the predicted box,

p_{i}^{g t}

is the true value of the target category, and

p_{i}^{T}

is the predicted probability of the teacher network containing the target in the predicted box. Similarly, the bounding box loss for the student network is represented as follows:

\begin{matrix} L_{C I o U}^{C o m b} (b^{g t}, {\hat{b}}_{i}, b_{i}^{T}, {\hat{o}}_{i}^{T}) = L_{C I o U} (b_{i}^{g t}, {\hat{b}}_{i}) + {\hat{o}}_{i}^{T} \cdot λ_{D} \cdot L_{C I o U} (b_{i}^{T}, {\hat{b}}_{i}) \end{matrix}

(4)

where

{\hat{b}}_{i}

is the predicted box coordinates of the student network,

b_{i}^{g t}

is the true bounding box coordinates, and

b_{i}^{T}

is the predicted box coordinates of the teacher network. Overall, in the improved loss function, knowledge of the teacher model in detecting the target is added, and the predictions for the background units are discarded. The total loss function during training is represented as follows:

\begin{matrix} L_{Y O L O K D} = L_{c o n f}^{C o m b} (o_{i}^{g t}, {\hat{o}}_{i}, o_{i}^{T}) + L_{c l s}^{C o m b} (p_{i}^{g t}, {\hat{p}}_{i}, p_{i}^{T}, \hat{o_{i}^{T}}) + L_{C l o U}^{C o m b} (b^{g t}, {\hat{b}}_{i}, b_{i}^{T}, {\hat{o}}_{i}^{T}) . \end{matrix}

(5)

During the dual-layer optimization process,

L_{Y O L O - K D}

is used to optimize the network parameters w.

In optimizing the architecture

α

and simultaneously searching for the backbone network of the underwater target recognition model, the sigmoid function is used instead of SoftMax, to avoid the phenomenon of skip connection enrichment during the search phase. To make the distinction between different operations more pronounced and to enhance the binary nature of the architecture weights, a 0–1 loss is introduced to ensure that

σ (α_{i})

are closer to 0 or 1. Using the weight coefficient

w_{0 - 1}

to control the intensity of the 0–1 loss, the total loss for the searching architecture

α

is as follows:

\begin{matrix} L_{α - t o t a l} = L_{v a l} (w^{♭} (α), α) + w_{0 - 1} L_{0 - 1} \end{matrix}

(6)

where

L_{0 - 1}

is the 0–1 loss.

2.2. Construction of the Student Model

For a deep neural network performing classification tasks, we stack a super net at the cell level. This super net serves as the search space for finding the optimal cell, thus obtaining the most suitable super net. We similarly stack a super net shaped like YOLO to search for an architecture

α

suitable for underwater datasets in the context of underwater object recognition networks.

In constructing the super net for object recognition tasks, we perform downsampling at appropriate computational nodes to obtain different receptive fields and improve the recognition of small underwater objects. This approach is similar to the original YOLO framework. In YOLO-V5, a 32× downsampling of the feature map is achieved through five convolution operations. Similarly, for our super net, we use 32× downsampling. This requires three ordinary convolutional layers and two basic computational units. The structure is shown in Figure 2.

The basic computational unit is divided into two types: Normal and Reduce. Similarly, at depths of 1/3 and 2/3 of the network, Reduce computational units are set, and the rest are Normal computational units. Stacking these basic computational units, the backbone network of the underwater target recognition network is formed. In this paper, the lightweight backbone network obtained through the neural architecture search is named the “TN” (teacher-guided neural architecture search) module. After obtaining the lightweight backbone network TN, similar to in YOLO-V5, detection heads are placed at downsampling positions of 32×, 16×, and 8× in the entire object recognition network’s feature maps to obtain the model structure of YOLO-TN.

2.3. Parameter Pruning

To compress the underwater target recognition model and improve inference frames per second (FPS), we apply a pruning algorithm to drastically trim the detection head of YOLO-TN. This paper introduces a simple-yet-effective pruning method. L1 regularization is applied to the scaling factors in the Batch normalization (BN) layer, pushing them toward zero to identify irrelevant channels [47]. After pruning, any accuracy loss can be mitigated through retraining.

2.4. Evaluation Criteria

This study aims to deploy the model to resource-constrained embedded systems, necessitating additional metrics. Floating-Point Operations per Second (FLOPs) represents the floating-point operations required by the network, indicating the computational load. A lower FLOPs value for image processing on a particular device implies that it can process more images in the same time frame. Frames per second (FPS) is a common concept in object recognition and image processing. In the domain of object recognition, a higher FPS implies that more images can be detected per second, ensuring better real-time performance.

The number of parameters represents the sum of all parameters that need to be trained during model training. It determines the model’s size, affecting the memory or GPU memory it occupies. Generally, a smaller number of model parameters results in fewer FLOPs, less storage space required to save the model, and lower hardware requirements.

In object recognition tasks, evaluating a model’s performance is challenging as it often involves classifying multi-label images. Mean average precision (mAP) is a widely used metric to assess model performance in object recognition tasks. mAP represents the average value of Average Precision (AP) for each category, where AP is the Average Precision that evaluates the detection performance for a specific class. It is calculated as the area under the Precision–Recall curve. By computing the AP for each class, summing them, and then taking the average, the mAP value can be obtained. mAP can be further categorized based on different Intersection over Union (IoU) thresholds, such as mAP@0.5 (IOU threshold is 0.5) to mAP@0.95.

3. Model Validation

This study conducts experiments using a multi-GPU training and single-GPU/CPU inference mode. The hardware setup is as follows:

Server Configuration:

-: CPU: Intel(R) Xeon(R) Silver 4110;
-: Memory: 36 GB DDR4;
-: GPU: Nvidia 2080TI.

Software Configuration:

-: Ubuntu 18.04;
-: PyTorch 1.9.0;
-: CUDA 11.2.

The experiment begins by using the architecture obtained from the super net as the basic building block. Different experiments are conducted by stacking various parameters to validate the model performance and inference speed of the YOLO-TN model under different parameter settings. The dataset used is the fish–trash dataset officially included in the YOLO collection. This dataset contains 1823 clear images of plastic waste and fish.

The experimental results confirm that the architecture obtained using the network architecture search method in this paper exhibits superior performance on the given dataset compared to conventional search methods and is more lightweight. In the comparison search experiment, the network input image size is set to 640 × 640 and the batch size to 128. The search lasts for 300 epochs, with 10 initial stack basic units and 16 initial channels. After comparative search experiments, the optimal search architecture suitable for a real underwater dataset is obtained. The Normal and Reduce computation units constructed using this architecture are shown in Figure 3.

After determining the structure of the basic computational unit for the offline underwater target recognition algorithm, further experiments are conducted by varying the input channels and the number of basic units in different cells. The goal is to develop a lightweight underwater target recognition network with excellent performance. The experimental results are shown in Table 1. From Table 1, it is evident that replacing the backbone network of YOLO-V5 with the TN module results in a significant reduction in both parameter count and FLOPs compared to the YOLO-V5s network. The most substantial reduction is observed in YOLO-TN(e), where the model parameters are only 11.67% of YOLO-V5s. Interestingly, the mAP@0.5 value for YOLO-TN(e) is 0.97% higher than that of YOLO-V5s.

When comparing YOLO-TN(a) and YOLO-TN(b), increasing the initial channels significantly enhances mAP@0.5. However, it also leads to a decrease in FPS and an increase in FLOPs, which is not favorable for real-time inference on embedded devices. Similarly, increasing the number of cell units may hinder the ability to extract features from small targets, while too few units can compromise the effectiveness of the super network. Balancing model effectiveness and real-time inference, we choose YOLO-TN(e) with only four basic computational units (already sufficiently small) and an initial channel of 16 for further compression. The structure of YOLO-TN(e) is illustrated in Figure 4.

Later, the network structure undergoes parameter pruning for the final extreme compression. Pruning rates ranging from 0.4 to 0.8 with a step size of 0.1 are set. This paper explores the effects of channel pruning based on BN layer coefficients and convolutional kernel pruning based on the L2 norm (referred to as prune-A and prune-B). Through analysis of experimental results, it is observed that models pruned using the prune-A strategy generally exhibit higher accuracy than the other strategy under the same pruning rate. Moreover, at a pruning rate of 0.5, it achieves the optimal mAP@0.5 value in the experiment.

As observed in Figure 5, the lightweight network proposed in this paper, YOLO-TN, achieves a smaller parameter count and faster inference speed compared to the benchmark models. Considering both mAP@0.5 value and inference speed, the YOLO-TN model pruned using strategy prune-A at a pruning rate of 0.5 stands out as the optimal choice, striking a balance between model recognition accuracy and real-time performance. Finally, we apply the complete network structure to the dataset, and the training results are compared with that of YOLO-V5s, as illustrated in Figure 6.

In Figure 6, the training data for YOLO-TN-prune-A-0.5 are compared with those of YOLO-V5s using the fish–trash dataset. The mAP@0.5 curve indicates that the lightweight network rapidly achieves a model with superior performance. The loss curve demonstrates that the training loss, guided by the teacher network, decreases more rapidly and converges earlier. This is attributed to the lightweight nature of the model, facilitated by the depthwise separable convolutional structure, allowing for swift learning of sample features and parameter updates.

At an input image size of 640 × 640, YOLO-TN’s FPS outperforms the listed models in this paper. Nevertheless, for devices with limited resources, ensuring real-time predictions remains challenging. Therefore, training is performed by reducing the input image size, and the model’s performance is tested with reduced input dimensions of 416 × 416 and 320 × 320. The results are presented in Table 2.

The experimental results indicate that, when the input size is reduced to 320 × 320, the model achieves 38.4 FPS and an mAP@0.5 of 0.5101. Real-time performance is significantly improved compared to the original network, making it suitable for offline devices with smaller resources. At the same time, different scales of input can be selected according to different needs.

4. Engineering Experiment

4.1. Evaluation Criteria

To simulate a real underwater environment, most of the dataset used in this study came from Qiandao Lake in Hangzhou, Zhejiang Province, which has a depth of up to 200 m, and a small part came from a scenic fish pond north of the city. Since our acquisition equipment, the Youcan BW SpacePro underwater drone, is not allowed to work in too-deep underwater conditions, the data are captured at depths of 0.5–30 m. The drone is equipped with a SONY 4k high-definition camera. The video is recorded at a resolution of 1920 × 1080, the number of frames per second is extracted, and 5964 valid images are obtained, as shown in Figure 7. Considering the size and number of targets in the training sample, the acceptance field and the number of layers of the network are analyzed. However, the underwater environment degrades the image due to the attenuation of scattered and absorbed light, resulting in illumination attenuation, blurring, and low contrast. To mitigate these problems, especially in underwater target recognition missions, image recovery is key for preventing false positives.

4.2. Data Preprocessing

This paper uses the dark channel dehazing algorithm [52] to address image degradation, blurring, and low visibility in turbid underwater images. The underwater color restoration algorithm [53], which accounts for visible light attenuation, is applied to correct image degradation from insufficient lighting and restore the original color. An automatic color balance algorithm [54] is also used to correct color distortion in deep underwater conditions. After processing, the image’s color histogram ratio becomes similar to that of the surface environment. These image processing algorithms are applied as preprocessing steps during the training phase to enhance the dataset and improve model generalization. The image enhancement effect is shown in Figure 8.

4.3. Model Application

The processed dataset is used for training in the network, and the training results of YOLO-TN-Prune-A0.5 and YOLO-V5s are shown in Figure 9.

In this study, YOLO-V5s, unpruned YOLO-TN, and pruned YOLO-TN models with three input sizes are selected as experimental models. After converting the experimental models uniformly from .pth format to .mnn format, they are loaded into the MNN inference framework for inference experiments. The experimental results are shown in Table 3.

The YOLO-TN, with an input size of 320 × 320, achieves real-time performance in inference and simultaneously ensures good recognition results. This suggests that the model can be deployed on the Jetson TX2 to complete a real-time underwater object recognition system. Figure 10 displays the recognition results of YOLO-V5, unpruned YOLO-TN, and YOLO-TN with different input sizes in different scenarios. It can be observed that, at an input size of 320 × 320, the pruned YOLO-TN recognizes a similar number of objects with accuracy comparable to YOLO-V5s. In terms of inference speed, its FPS is nearly 12 times that of YOLO-V5s.

5. Conclusions and Discussion

This paper solves the challenge of the high computing requirement of underwater offline recognition and makes it effective for offline embedded equipment. We propose an enhanced lightweight underwater target network and establish a real-time underwater target offline recognition system. We construct a feature extraction network with fewer parameters using a specialized network search algorithm, replacing the backbone of YOLO-V5. By applying various compression methods and parameter pruning, we ultimately develop the lightweight target detection model YOLO-TN. Additionally, we collect a dataset under real underwater conditions, addressing challenges such as the scarcity of underwater target datasets, single-image targets, overly idealized collection environments, and excessively static targets. In view of the limited computing power of the terminal, this paper uses a variety of simple and effective image enhancement algorithms to enhance the dataset. Based on this, we establish a real-time underwater target offline recognition system. This system, deployed on the NVIDIA Jetson TX2 embedded platform using the high-performance inference framework MNN, achieves real-time inference for underwater targets. Video recognition experiments are conducted to test the model’s performance under different input sizes. Results show that the lightweight model maintains recognition accuracy while achieving real-time inference speeds.

However, there are still some limitations in our work. First, improving the lightweight structure search method. The proposed lightweight network architecture search method requires setting two hyperparameters—number of layers and initial channels—when constructing the final network, which may cause the network to fall into a local optimum. Second, the dataset in this paper was collected from a fixed water area, with the acquisition equipment set to fixed parameters, which may lead to limited data diversity. Therefore, in future work, we will continue to expand the dataset by considering factors such as different climatic environments, lighting conditions, water areas, and depths to further enhance data diversity.

Author Contributions

W.L. (Weishan Li): Data Curation, Formal Analysis, Investigation. Methodology, Validation, Writing—Original Draft. Y.L.: Conceptualization, Funding Acquisition, Methodology, Project Administration, Writing—Review & Editing. R.L.: Supervision, Visualization. H.S.: Data Curation, Investigation, Resources. W.L. (Wenjun Li): Funding Acquisition, Supervision, Project Administration. K.Y.: Supervision, Resources. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Experimental data cannot be published due to limitations.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Zhang, R.; Li, S.; Ji, G.; Zhao, X.; Li, J.; Pan, M. Survey on Deep Learning-Based Marine Object Detection. J. Adv. Transp. 2021, 2021, 5808206. [Google Scholar] [CrossRef]
Dakhil, R.A.; Khayeat, A.R.H. Review on deep learning techniques for marine object recognition: Architectures and algorithms. In Proceedings of the CS & IT-CSCP 2022, Vancouver, BC, Canada, 26–27 February 2022; pp. 49–63. [Google Scholar]
Myers, V.; Fawcett, J. A template matching procedure for automatic target recognition in synthetic aperture sonar imagery. IEEE Signal Process. Lett. 2010, 17, 683–686. [Google Scholar] [CrossRef]
Barngrover, C.M. Automated Detection of Mine-like Objects in Side Scan Sonar Imagery; University of California: San Diego, CA, USA, 2014. [Google Scholar]
Abu, A.; Diamant, R. A statistically-based method for the detection of underwater objects in sonar imagery. IEEE Sensors J. 2019, 19, 6858–6871. [Google Scholar] [CrossRef]
Kim, B.; Yu, S. Imaging sonar based real-time underwater object detection utilizing adaboost method. In Proceedings of the 2017 IEEE Underwater Technology (UT), Busan, Republic of Korea, 21–24 February 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
Chiang, J.; Chen, Y. Underwater image enhancement by wavelength compensation and dehazing. IEEE Trans. Image Process. 2012, 21, 1756–1769. [Google Scholar] [CrossRef]
Akkaynak, D.; Treibitz, T. A revised underwater image formation model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Lu, H.; Li, Y.; Zhang, Y.; Chen, M.; Kim, S.S. Underwater Optical Image Processing: A Comprehensive Review. Mobile Netw. Appl. 2017, 22, 1204–1211. [Google Scholar] [CrossRef]
Anwar, S.; Li, C. Diving deeper into underwater image enhancement: A survey. Signal Process. Image Commun. 2020, 89, 115978. [Google Scholar] [CrossRef]
Liu, C.; Wang, Z.; Wang, S.; Tang, T.; Tao, Y.; Yang, C.; Li, H.; Liu, X.; Fan, X. A New Dataset, Poisson GAN and AquaNet for Underwater Object Grabbing. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5. [Google Scholar] [CrossRef]
Yang, J.; Han, P.; Li, X. Equilibrating the impact of fluid scattering attenuation on underwater optical imaging via adaptive parameter learning. Opt. Express 2024, 32, 23333–23346. [Google Scholar] [CrossRef]
Oscar, B.; Edmunds, P.J.; Kline, D.I.; Mitchell, B.G.; Kriegman, D. Automated annotation of coral reef survey images. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1170–1177. [Google Scholar]
Palazzo, S.; Kavasidis, I.; Spampinato, C. Covariance based modeling of underwater scenes for fish detection. In Proceedings of the 20th IEEE International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013; pp. 1481–1485. [Google Scholar]
Ravanbakhsh, M.; Shortis, M.R.; Shafait, F.; Mian, A.; Harvey, E.S.; Seager, J.W. Automated fish detection in underwater images using shape-based level sets. Photogramm. Rec. 2015, 30, 46–62. [Google Scholar] [CrossRef]
Hou, G.-J.; Luan, X.; Song, D.-L.; Ma, X.-Y. Underwater man-made object recognition on the basis of color and shape features. J. Coast. Res. 2015, 32, 1135–1141. [Google Scholar] [CrossRef]
Vasamsetti, S.; Setia, S.; Mittal, N.; Sardana, H.K.; Babbar, G. Automatic underwater moving object detection using multi-feature integration framework in complex backgrounds. IET Comput. Vis. 2018, 12, 770–778. [Google Scholar] [CrossRef]
Wang, Q.; Zeng, X. Deep learning methods and their applications in underwater targets recognition. In Proceedings of the 2015 Academic Conference of the Hydroacoustics Branch of the Acoustical Society of China, Hydroacoustics Branch of the Acoustical Society of China, Harrogate, UK, 15 October 2015; p. 3. Available online: https://kns.cnki.net/kcms2/article/abstract?v=zcLOVLBHd2yuc0K9K0lIzqLOnyKffA5JXrD7S_1b3A_AZXUYyZdd4zqOJi6uoXZuBegPu97bvG__mRmWiZ1qiES5LkrfFdAaLnkYK8_GA9f1_xAZ0NOvmf3X2L4wqsnvfrs4_PiwGj1e4kfoQ9LpLw==&uniplatform=NZKPT&language=CHS (accessed on 20 December 2023).
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R. Faster r-cnn: Towards realtime object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-Cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Beery, S.; Wu, G.; Rathod, V. Context r-cnn: Long term temporal context for per-camera object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13075–13085. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Juan, R.; Terven, D.M.C.E. A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond. arXiv 2023. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Huynh-Thu, Q.; Ghanbari, M. Perceived Quality of the Variation of the Video Temporal Resolution for Low Bit Rate Coding. Available online: https://www.researchgate.net/publication/266575823/_Perceived_quality_of_the_variation_of_the_video_temporal_resolution_for_low_bit_rate_coding (accessed on 20 December 2023).
Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, Cambridge, MA, USA, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 1135–1143. [Google Scholar]
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Barcelona Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2082–2090. [Google Scholar]
Lin, M.; Ji, R.; Wang, Y.; Zhang, Y. HRank: Filter Pruning Using High-Rank Feature Map. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1526–1535. [Google Scholar] [CrossRef]
Gao, S.; Huang, F.; Cai, W.; Huang, H. Network Pruning via Performance Maximization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9266–9276. [Google Scholar] [CrossRef]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv 2021. [Google Scholar] [CrossRef]
Faraone, J.; Fraser, N.; Blott, M.; Leong, H.W. SYQ: Learning Symmetric Quantization for Efficient Deep Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4300–4309. [Google Scholar] [CrossRef]
Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv 2016. [Google Scholar] [CrossRef]
Chen, P.; Liu, J.; Zhuang, B.; Tan, M.; Shen, C. AQD: Towards Accurate Quantized Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 21–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 104–113. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Wang, X.; Kan, M.; Shan, S.; Chen, X. Fully Learnable Group Convolution for Acceleration of Deep Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9041–9050. [Google Scholar] [CrossRef]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; ACM: New York, NY, USA; pp. 535–541. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv 2017. [Google Scholar] [CrossRef]
Heo, B.; Lee, M.; Yun, S.; Choi, J.Y. Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons. AAAI 2019, 33, 3779–3787. [Google Scholar] [CrossRef]
Peng, B.; Jin, X.; Liu, J.; Zhou, S.; Wu, Y.; Liu, Y.; Li, D.; Zhang, Z. Correlation Congruence for Knowledge Distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5006–5015. [Google Scholar] [CrossRef]
Cho, J.H.; Hariharan, B. On the Efficacy of Knowledge Distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4793–4801. [Google Scholar] [CrossRef]
Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved Knowledge Distillation via Teacher Assistant. AAAI 2020, 34, 5191–5198. [Google Scholar] [CrossRef]
Liu, Y.; Jia, X.; Tan, M.; Vemulapalli, R.; Zhu, Y.; Green, B.; Wang, X. Search to Distill: Pearls Are Everywhere but Not the Eyes. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 7536–7545. [Google Scholar] [CrossRef]
Shen, S.H.; Li, Y.L.; Qiang, Y.K.; Xue, R.L.; Jun, W.L. Research on Compression of Teacher Guidance Network Use Global Differential Computing Neural Architecture Search. In Proceedings of the 2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 27–30 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 526–531. [Google Scholar] [CrossRef]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. arXiv 2019. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar]
Berman, D.; Treibitz, T.; Avidan, S. Diving into haze-lines: Color restoration of underwater images. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; Volume 1. [Google Scholar]
Getreuer, P. Automatic color enhancement (ACE) and its fast implementation. Image Process. Line 2012, 2, 266–277. [Google Scholar] [CrossRef]

Figure 1. This approach uses YOLO-V5 as a teacher network while building a student network whose architecture needs to be searched within the search space. Finally, the student network obtained by the search is pruned to obtain the lightweight YOLO-TN.

Figure 2. Schematic diagram of the search network structure. This network framework uses three normal convolutional layers and two basic computing units to achieve 32× downsampling while setting up Reduce computing units at a 1/3 and 2/3 depth of the network, stacked as the backbone network.

Figure 3. Each computational unit is primarily composed of five nodes, and the output is the cascaded output of all intermediate nodes within the computational unit. The first and second input nodes of the i-th computational unit are set as the outputs of the (i − 2)-th and (i − 1)-th computational units, respectively, and a 1 × 1 convolution may be inserted as needed. The first intermediate node is obtained by linearly transforming the two input nodes, adding their results, and then applying the tanh activation function. Other configurations are similar to ENAS’s cell, enabling batch normalization in each node to prevent gradient explosions during architecture search.

Figure 4. Structure of the lightweight YOLO-TN model. The new architecture employs depthwise separable convolution, which divides a standard convolution into depthwise convolution and pointwise convolution, leading to a reduction in the number of parameters and computational workload.

Figure 5. Model performance comparison with different pruning strategies and rates. (a) is the comparison of the performance in terms of mAP@0.5, (b) is the comparison of FLOPs performance, (c) is the comparison of the FPS performance on the CPU, and (d) is the comparison of the FPS performance on the GPU.

Figure 6. Training mAP and loss curves of YOLO-TN and YOLO-V5s using the fish–trash dataset. (a) is the mAP@0.5 iteration curve, (b) is the mAP@0.5:0.95 iteration curve, (c) is the box loss convergence curve, (d) is the object loss convergence curve. It is evident that the model converges at epoch 175.

Figure 7. Model performance comparison with different pruning strategies and rates. (a) is the device used for shooting. (b) is a field photo of Qiandao Lake. (c) is a field photo of a park. (d) is the sum of the collected data. (e,f) are the two simple images in the dataset.

Figure 8. Dark channel dehazing algorithm processing blurred image and automatic color equalization algorithm processing underwater deep color distortion image. (a,b) are the two simple images in the dataset. (c) is the image in (a) processed by the dark channel defogging algorithm. (d) is the image in (b) processed by the underwater color restoration algorithm and the automatic color balance algorithm.

Figure 9. Training mAP and loss curves of YOLO-TN and YOLO-V5s using real underwater dataset. (a) is the mAP@0.5 iteration curve, (b) is the mAP@0.5:0.95 iteration curve, (c) is the box loss convergence curve, (d) is the object loss convergence curve. It is evident that the model converges at epoch 175.

Figure 10. Results of YOLO-V5, unpruned YOLO-TN, and YOLO-TN with different input sizes in underwater target recognition. The new architecture employs depthwise separable convolution, which divides a standard convolution into depthwise convolution and pointwise convolution, leading to a reduction in the number of parameters and computational workload.

Table 1. Experimental results for different initial channels and basic unit numbers.

Model	Cell Number	Initial Channel	mAP@0.5 (Undistilled/Distilled)	Parameter (M)	FPS (GPU/CPU)	FLOPs (G)
YOLO-TN(a)	10	8	0.5038/0.5205	2.8704	112.7/9.8	7.8
YOLO-TN(b)	10	16	0.5312/0.5441	3.0516	109.4/7.7	9.1
YOLO-TN(c)	7	16	0.5326/0.5437	1.2083	134.5/8.9	3.9
YOLO-TN(d)	5	16	0.5355/0.5471	0.9481	162.7/10.2	3.5
YOLO-TN(e)	4	16	0.5384/0.5592	0.8401	176.3/14.1	3.3
YOLO-V5s	-	-	0.5495/-	7.2	178.9/8.3	16.5

Table 2. Model experiment results with different input sizes.

Model	Input Size	mAP@0.5	FPS (GPU/CPU)	FLOPs (G)
YOLO-TN-640	640 × 640	0.5592	176.8/17.2	3.3
YOLO-TN-416	416 × 416	0.5425	176.9/28.8	2.8
YOLO-TN-320	320 × 320	0.5101	177.6/38.4	2.8

Table 3. MNN inference framework’s inference FPS on TX2.

Model	Input Size	FPS (CPU)
Pruned YOLO-TN-640	640 × 640	10.8
Pruned YOLO-TN-416	416 × 416	20.4
Pruned YOLO-TN-320	320 × 320	28.6
Unpruned YOLO-TN	640 × 640	8.9
YOLO-V5s	640 × 640	2.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Li, Y.; Li, R.; Shen, H.; Li, W.; Yue, K. Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression. J. Mar. Sci. Eng. 2024, 12, 1760. https://doi.org/10.3390/jmse12101760

AMA Style

Li W, Li Y, Li R, Shen H, Li W, Yue K. Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression. Journal of Marine Science and Engineering. 2024; 12(10):1760. https://doi.org/10.3390/jmse12101760

Chicago/Turabian Style

Li, Weishan, Yilin Li, Ruixue Li, Haozhe Shen, Wenjun Li, and Keqiang Yue. 2024. "Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression" Journal of Marine Science and Engineering 12, no. 10: 1760. https://doi.org/10.3390/jmse12101760

APA Style

Li, W., Li, Y., Li, R., Shen, H., Li, W., & Yue, K. (2024). Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression. Journal of Marine Science and Engineering, 12(10), 1760. https://doi.org/10.3390/jmse12101760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Rapid Detection of Underwater Targets Based on Global Differential Model Compression

Abstract

1. Introduction

2. Model Construction

2.1. Loss Function

2.2. Construction of the Student Model

2.3. Parameter Pruning

2.4. Evaluation Criteria

3. Model Validation

4. Engineering Experiment

4.1. Evaluation Criteria

4.2. Data Preprocessing

4.3. Model Application

5. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI