YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring

Lei, Lingrui; Tang, Ying; Zhang, Weidong; Tang, Quan; Hao, Haichi

doi:10.3390/app15137052

Open AccessArticle

YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring

by

Lingrui Lei

,

Ying Tang

^*,

Weidong Zhang

,

Quan Tang

and

Haichi Hao

College of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7052; https://doi.org/10.3390/app15137052

Submission received: 21 May 2025 / Revised: 19 June 2025 / Accepted: 19 June 2025 / Published: 23 June 2025

Download

Browse Figures

Versions Notes

Abstract

In recent years, China has been promoting aquaculture, but extensive water pollution caused by production activities and climate changes has resulted in losses exceeding 4.6 × 10⁷ kg of aquatic products. Widespread water pollution from production activities is a key issue that needs to be addressed in the aquaculture industry. Therefore, dynamic monitoring of water quality and fish-specific solutions are critical to the growth of fry. Here, a low-cost, small, and real-time monitorable bionic robotic fish based on YOLO-PWSL (PConv, Wise-ShapeIoU, and LGFB) is proposed to achieve intelligent control of aquaculture. The bionic robotic fish incorporates a caudal fin for propulsion and adaptive buoyancy control for precise depth regulation. It is equipped with various types of sensors and wireless transmission equipment, which enables managers to monitor water parameters in real time. It is also equipped with YOLO-PWSL, an improved underwater fish identification model based on YOLOv5s. YOLO-PWSL integrates three key enhancements. In fact, we designed a multilevel attention fusion block (LGFB) that enhances perception in complex scenarios, to optimize the accuracy of the detected frames, the Wise-ShapeIoU loss function was used, and in order to reduce the parameters and FLOPs of the model, a lightweight convolution method called PConv was introduced. The experimental results show that it exhibits excellent performance compared with the original model: the mAP@0.5 (mean average precision at the 0.5 IoU threshold) of the improved model reached 96.1%, the number of parameters and the amount of calculation were reduced by 1.8 M and 3.1 G, respectively, and the detected leakage was effectively reduced. In the future, the system will facilitate the monitoring of water quality and fish species and their behavior, thereby improving the efficiency of aquaculture.

Keywords:

real-time monitoring; aquaculture; underwater visual perception

1. Introduction

The ocean has rich natural resources and a suitable biological environment [1]. In recent years, the rapid increase in population has posed a huge challenge to global food security [2]. In this context, China has vigorously promoted aquaculture and constructed deep-sea net cages [3,4], but extensive water pollution caused by production activities and climate changes have resulted in losses exceeding 4.6 × 10⁷ kg of aquatic products [5]. Therefore, both real-time detection of the water environment and accurate feeding are significant challenges when conducting precision aquaculture [6,7]. In this regard, the application of underwater robots in aquaculture plays an important role in timely information acquisition, real-time monitoring, and underwater work. However, it is associated with numerous threats and challenges [8]. They are mainly reflected in the following aspects: firstly, its high implementation cost is a constraint [9,10], and in addition, it may cause significant impacts on aquatic ecosystems. At the same time, it also puts forward stringent technical requirements in terms of real-time monitoring and data accuracy [11,12]. With underwater robots being used in an increasing number of fields, and with the development of bionic robotics in recent years, the bionic robotic fish has been developed [13,14]. Its advantages include low intrusiveness and high flexibility [15]. Therefore, it has been used in a large number of real-time monitoring applications and for data statistics in water and plays an important role [16,17]. However, the improved bio-imitability of underwater robots has reduced their ability to operate underwater to a certain extent. Therefore, combining bio-imitability and underwater operation capability still poses a challenge in the use of bionic robotic fish, which is a key issue for improving aquaculture efficiency [18].

As underwater robots are being applied in an increasing number of fields, extensive research on them has been conducted both domestically and internationally. Chen Xiaojun and Li Dejin et al. designed and developed a 3D-printed bionic robotic fish for aquaculture; however, it did not have the ability to independently complete underwater operations [19]. Zhao Donglei et al. proposed a large-scale bionic yellow fish robot model, analyzed the hydrodynamic resistance of the model by body size, body length, and material, and found that the effect of different epidermal materials on water resistance varied greatly at low water velocity [20]. Huang et al. found that different colors of robotic fish were differently attractive to carps, and white could effectively reduce the disturbance to fish schools [21]. Ivan Tanev of the Graduate School of Science and Engineering, Doshisha University, proposed a bionic fish robot with two joints, which dramatically improved the robot’s speed and energy efficiency [22]. In terms of aquaculture monitoring systems, Md. Naymul Islam Nayoun designed an IoT system that can automatically measure parameters such as pH, temperature, oxygen level, etc., but could not measure the various parameters uniformly and made some errors [23]. Misbahuddin Misbahuddin et al. proposed a LoRaWAN-based aquaculture monitoring technique using the Kalman filter to effectively improve data accuracy, but it is too expensive [24]. Altaf Hussain et al. proposed an energy-efficient routing protocol for underwater sensor networks in the ocean that can effectively avoid path loss [25].

In terms of visual perception technology, underwater robots rely on computer vision algorithms to achieve target detection and recognition. In target detection, deep learning is primarily divided into two categories: one-stage and two-stage. One-stage algorithms primarily include SSD [26] and the YOLO [27] series. Representative two-stage algorithms include R-CNN [28], Fast-RCNN [29], and Faster-RCNN [30]. The original YOLO algorithm achieves end-to-end unified prediction by transforming object detection into a regression problem. Subsequent research has further optimized its underwater adaptability [31]. For example, in 2024, Yan et al. from Harbin Engineering University proposed a real-time underwater fish detection and recognition algorithm based on the lightweight design of the CBAM-YOLO network, which effectively identified and detected six different fish species. However, its computational requirements are high, which makes it unsuitable for real-time underwater target detection tasks [32]. In 2021, Ranjith Dinakaran et al. proposed an algorithmic model combining DCGAN + SSD + PSO to improve the performance of all aspects of the model for undersea target biodiversity data collection. However, due to folding, obscuring, and various lighting conditions, the underwater background changes, and the dataset did not fully take this into account [33]. In 2024, the Guangdong Provincial Laboratory of Marine Science and Engineering in the South proposed an underwater target detection network with a cavity deformation convolution and a two-branch masking attention mechanism—Underwater-Yolo, an underwater target detection network with good detection capability in closed and complex underwater environments—but the model still performs poorly in high-density target and occlusion scenarios [34]. Tong et al. proposed a dead fish identification model based on the YOLOv5s model, which showed improved detection accuracy by 4.5% compared to the baseline model, but there has been little research on the identification of dead fish at night or in low light conditions [35]. The integration of visual perception and bionic machines in underwater robotics will allow for future development in underwater robotics [36,37]. For example, Liu Yang et al. from China Agricultural University designed a new bionic manta ray for sea cucumber identification, achieving 94.5% detection accuracy, but its underwater capabilities are limited [38].

These studies indicate that the existing systems still have significant shortcomings in terms of target detection capabilities and multi-functional integration in complex environments (such as dense, obstacle-filled environments) and are highly susceptible to environmental interference. Therefore, building on previous work, our research focuses on enhancing the target detection algorithm’s recognition capabilities in complex environments and expanding the functional diversity of robotic fish. Here, we propose a low-cost, real-time detection and object recognition-capable underwater biomimetic fish robot system. Its objective is to achieve the real-time detection of water quality parameters and adaptation in real time to the environmental conditions required by different fish species. This system lays the foundation for real-time monitoring in aquaculture.

This study proposes the following improvements:

We design an attention fusion block LGFB (Local–Global Fusion Block), which improves perception in complex scenes by combining local and global attention branches with feature processing.
We introduce the loss function Wise-IoU and optimize the accuracy of the detection frame by embedding ShapeIoU in it, a loss computation method that combines target shape features and dynamically adjusted bounding box regression.
We introduce a lightweight convolutional PConv, which enhances feature extraction by relying only on valid pixels in the computation process and effectively solves the missing data problem. Through experimental analysis, the introduction of lightweight convolutional PConv not only reduces the number of model parameters but also improves the accuracy.
We design a sinking and floating system to control the stability of the bionic fish in water by a PID algorithm combined with depth sensors and reasonably equipped with temperature, turbidity, depth sensors and a foldable robotic arm.

2. Materials and Methods

2.1. Design of a Bionic Mechanical Fish

2.1.1. Overall Structure of the Bionic Fish

Figure 1 shows the 3D modelling diagram, cross-sectional diagram, and cabin body diagram of the biomimetic mechanical fish created using SolidWorks (2023 version). The main body of the tank is made of acrylic material, with the power module, control components, and sensor module integrated into the cabin. To control the lifting and hovering of the biomimetic robotic fish, the proposed method incorporates two built-in jet propulsion units driven by two DC motors within the fish. Precise control is achieved through a design combining depth sensor feedback with PID compensation. To reduce disturbance to the aquatic environment and wildlife, the proposed method employs a tail fin module composed of two servo motors, which causes less disturbance to the aquatic environment compared to small propellers for movement control. Given the need for lightweight, high strength, and good ductility for materials in underwater environments, a thermoplastic resin was used for the 3D printing of the modules. The external tail fin, body, and pectoral fin modules were modelled using SolidWorks software and 3D printed on a Tuozhu P1 (Shenzhen Tuozhu Technology Co., Ltd., Shenzhen, China) using PLA+ (material density: 1.3 g/cm³, filament diameter: 1.75 ± 0.02 mm, impact strength: 67.5 ± 4 kJ/m²). The mechanical fish has a total length of 730.17 mm, a total width of 186.91 mm, and a total height of 156.22 mm. The tail section measures 333.55 mm in length, 147.09 mm in height, and 156.22 mm in width; the body section measures 144.93 mm in length, 183.37 mm in width, and 93.69 mm in height; and the pectoral fins measure 77.88 mm in length, 89.43 mm in width, and 6.92 mm in height.

The compartment contains three sensors, a wireless transmission module, an image module, a main control chip, and a control module for the fish tail. Figure 2a shows the compartment loaded with the fish shell, and Figure 2b shows a three-dimensional model of the interior of the compartment, displaying the locations of each module.

2.1.2. Bionic Fish Composition Modules

The bionic fish modules are shown below in Figure 3. We will explain the functions of each module in detail.

(a): STM32H7B0VBT6 System Board

This method uses the STM32H7B0VBT6 (Geek STM32 Electronic Development Network, Shenzhen, China) chip to achieve data exchange and coordination between modules, driving each module to work closely together. Water pressure sensors and PID algorithms are used to control the reservoir module for buoyancy, and the power module is used for movement. Data is obtained in real time through three sensors and transmitted to the host computer in real time through the transmission module.

(b): Power system

The tail fin section consists of two servos, which are driven to provide power. This module achieves different states by adjusting the duty cycle of the servo motor. Acceleration, deceleration, and swinging are achieved through the coordination of the working states of the two servo motors.

(c): Reservoir module

This module controls the proportion of water storage through two needle tubes controlled by DC motors. The system’s buoyancy is controlled through feedback from a depth sensor and PID algorithm regulation.

(d): Water pressure sensor

The model number of this water pressure sensor is XGZP6847A (China Jinhua Tugeli Technology Co., Ltd., Jinhua, China). It is installed inside the machine and connected to the outside of the machine via a hose. It is used to measure the current water pressure in order to determine the depth of the machine underwater. The sensor is connected to the STM32H7B0VBT6 chip, providing corresponding water pressure information.

(e): Turbidity sensor

This module is a TDS sensor that collects data synchronously through two channels, offering high real-time performance. The chip is located inside the machine, while the probe is located outside the machine. The sensor is connected to the STM32H7B0VBT6 chip, providing corresponding water quality information.

(f): Temperature sensor

This module is a ds18b20 sensor. The sensor is connected to the STM32H7B0VBT6 chip, providing corresponding water temperature information.

(g): Image transmission module

The image transmission module is installed on the head of the underwater intelligent robot and connects directly to the host computer via Wi-Fi. The image transmission module can transmit stable underwater images to the host computer.

(h): Communication module

This module is a Bluetooth module that transmits data collected by sensors to the host computer in real time, while receiving corresponding information from the host computer.

(i): Robotic arm

The robotic arm consists of four servo motors, which can control its bending, folding, picking, and releasing.

2.2. Tail Fin Swing Design

2.2.1. Tail Fin Drive Design

The tail fin is the core design element of the bionic robotic fish. Through relevant concepts and force analysis, the motion of the caudal fin can be decomposed by the superposition of multiple waves to perform a stable cruising process [39]. Inspired by this, a bionic fish caudal fin scheme was designed in this study, as shown in Figure 4, where Wave I propagates from the posterior end of the body to the mid-caudal fin, while Wave II propagates from the mid-caudal fin to the tail. By coordinating these two sets of waves, the underwater robotic fish achieves efficient and agile movements. This allows the robotic fish to move ahead. If we change the amplitude and direction of the swing, the spatial thrust generated will satisfy the complex conditions necessary for the robot fish to turn or move forward.

2.2.2. Control Design

When operating underwater, an important indicator of work capacity is the maintenance of depth stability. Therefore, the proposed method involves controlling the volume of the built-in jet through two DC motors. This is combined with feedback from a depth sensor and compensation via a PID algorithm to control underwater depth stability. The design objective of this method is to rapidly allow the system to return to the original target position even when subjected to fluctuating disturbances. The data obtained from the depth sensor feedback is used to adjust the DC motors by setting appropriate PID parameters. The further the distance from the target depth, the faster the adjustment; the closer the distance to the target depth, the slower the adjustment. This enables a rapid and accurate depth adjustment.

(1): Depth feedback:

Fitting function linearity parameter: GapValue

Atmospheric pressure measurement value: pressure

Water depth value: depth

The depth values were calculated as follows:

D e p t h = D e p t h - P r e s s u r e

(1)

D e p t h = D e p t h \times \frac{100}{G a p V a l u e}

(2)

(2): PID mathematical model transfer function:

u (k) = k_{p} e (k) + k_{i} \sum e (k) + k_{d} (e (k) - e (k - 1))

(3)

2.2.3. Power Control

The tail fin power module consists of two joints. The two joints are controlled by two rudders to generate corresponding driving forces. Each rudder is controlled by a ‘triangular’ wave control signal with a sampling interval of 5 ms, and the instantaneous values of this signal, α1 and α2, are shown in the following Equations (3) and (4)

α 1 = \{\begin{matrix} A_{1} - \frac{400}{T} t (0 \leq t \leq \frac{T}{2}) \\ B_{1} + \frac{400}{T} t (- \frac{T}{2} \leq t \leq 0) \end{matrix}

(4)

α 2 = \{\begin{matrix} A_{2} - \frac{1000}{T} t (0 \leq t \leq \frac{T}{2}) \\ B_{2} + \frac{1000}{T} t (- \frac{T}{2} \leq t \leq 0) \end{matrix}

(5)

where

T

is the period,

A_{1}

is the maximum value of

α 1

,

B_{1}

is the minimum value of

α 1

,

A_{2}

is the maximum value of

α 2

, and

B_{2}

is the minimum value of

α 2

. The rising or falling edges of the triangular wave correspond linearly to the increase/decrease in the PWM duty cycle, eliminating the need for additional interpolation calculations. A triangular wave can be automatically generated by a timer in hardware, significantly simplifying the control logic. This avoids the complex real-time trajectory planning required for sine waves, reducing the computational load on the MCU. When reaching the maximum or the minimum value, this method smooths the abrupt transition of the triangular wave by optimizing the vertex (adding a 5 ms transition segment).

2.2.4. Circuit Design

To avoid short circuits and to install sensors in the right places to measure the data, we designed the internal circuit flowchart of the bionic fish shown in Figure 5, which rationally and efficiently improves the utilization of the space of the water storage tank and achieves more complex functions. As shown in the Figure 5, a single arrow indicates that the module is controlled or provided with information, a solid double arrow indicates that both sides interact via a wired connection, and a dotted double arrow indicates that both sides interact via a wireless connection.

2.3. YOLO-PWSL

2.3.1. Architecture of YOLO-PWSL

YOLOv5 is an efficient single-stage target detection algorithm, and its lightweight version, YOLOv5s, is widely used in resource-constrained scenarios due to its few parameters and fast inference. Since underwater fish recognition faces challenges such as variable target scales, complex ambient lighting, and strong background interference, targeted improvements for underwater scene characteristics are needed when using the generic YOLOv5s model for detection. In this study, we chose YOLOv5s, which balances computational efficiency and accuracy, as the baseline model, and the motivation for improvement was to solve the three core problems of underwater small-target fish: shape diversity adaptation, fuzzy target feature extraction, and computational resource constraints. To this end, we used the Wise-ShapeIoU loss function to optimize the bounding box regression in the model, adopted the LGFB module to enhance the multi-scale feature fusion capability, and introduced PConv to achieve lightweight computation. The improved network can effectively detect various types of fish targets in underwater images, including occluded individuals, fuzzy targets, and group-dense scenes. The improved architecture is shown in Figure 6. We replaced the original C3 layer with PConv and added three LGFB modules to perform multi-level feature fusion. When these modules are used together, Wise-ShapeIoU improves the localization accuracy of irregular fish bodies, LGFB captures key features through local–global attention, and PConv reduces the number of parameters while maintaining the feature expression capability.

2.3.2. LGFB (Local–Global Fusion Block) Module

In computer vision tasks, multi-level feature fusion and attention mechanisms have been widely demonstrated to significantly improve a model’s perception of complex scenes. However, the existing methods often face the problem that local and global information pieces are difficult to coordinate effectively when performing feature fusion. For this reason, this paper proposes the LGFB (Local–Global Fusion Block) module, which aims to achieve the adaptive fusion of local and global features and simultaneously capture the details of the fish body and the underwater complex environment so to improve the detection efficiency of the model. The structure of LGFB is shown in Figure 7. The working principle of LGFB can be divided into preprocessing, hierarchical attention branching, and feature fusion. The two input features are first convolved by 1 × 1 to reduce the computational complexity, and then three parallel operations are performed. The detailed features of the fish are captured by local attention, and the complex environment underwater is handled by global attention. The combination of local and global feature attention mechanisms is achieved by LocalGlobalAttention for each of the two input features, the two input features are fused by feature extraction through a 3 × 3 convolution, and then the parts are stitched together. Finally, the computational complexity is reduced by a 1 × 1 convolution, and feature reorganization is performed by a reparameterized convolution to improve parameter utilization. This multilevel feature fusion is suitable for dealing with underwater fish detection tasks, as fish may present different scales and morphologies at different depths and distances. Its ability to adaptively coordinate local and global information ensures that the model captures both the detailed features of the fish body and the overall distribution of the fish and background environment information. This adaptive fusion mechanism helps to improve the accuracy and robustness of detection.

2.3.3. Wise-ShapeIoU Loss Function

(1): ShapeIoU

ShapeIoU is an IoU calculation method based on the shape characteristics of the target for irregularly shaped targets [40]. The traditional IoU only considers the overlapping region of the bounding box and ignores the shape characteristics of the target, which may lead to inaccurate evaluation of irregularly shaped targets. In contrast, ShapeIoU introduces shape information based on the traditional IoU, which improves the consideration of the target shape and geometric structure and the accuracy of the bounding box regression. In the equations below, scale is the scaling factor, which is related to the size of the target in the dataset;

w w

and

h h

denote the weighting coefficients in the horizontal and vertical directions, respectively, and their values are related to the shapes of the GT boxes. Underwater fish populations are often irregular in shape and arrangement, and the bounding box may cover multiple objects of different shapes, not just regular rectangles. With the traditional IoU loss function, it is difficult to obtain accurate matches for such irregular target shapes, whereas ShapeIoU can better deal with these irregular shapes and morphological changes, particularly, the complex structure of different individuals in a fish school, by considering the geometry of the target and the alignment of the bounding box. The formulas for ShapeIoU are as follows:

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(6)

h h = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(7)

{d i s t a n c e}^{s h a p e} = h h \times \frac{{(x_{c} - {x_{c}}^{g t})}^{2}}{c^{2}} + w w \times \frac{{(y_{c} - {y_{c}}^{g t})}^{2}}{c^{2}}

(8)

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{- w_{t}})}^{θ}, θ = 4

(9)

L_{s h a p e - I o U} = 1 - I o U + {d i s t a n c e}^{s h a p e} + 0.5 \times Ω^{s h a p e}

(10)

(2): Wise-IoU

Wise-IoU is a dynamic, non-monotonically focused IoU optimization method that improves evaluations in the presence of complex and inhomogeneous backgrounds [41]. Compared with the traditional IoUs, Wise-IoU improves the detection performance of the model in various complex scenarios by weighting different parts of the IoU and dynamically adjusting the weighting mechanism. In underwater environments, fish are often disturbed by factors such as water currents, light changes, and marine impurities, which may lead to complex background information. Individual fish sometimes obscure or overlap each other, which results in blurred boundaries between the targets. The Wise-IoU loss function can effectively address these challenges by increasing the focus on important regions (e.g., regions with more occlusion or overlap) based on the feedback of the model during the training process, which improves the accuracy of target detection in these complex background and occlusion situations. The formula for Wise-IoU is as follows:

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(11)

R_{W I o U} = \exp (\frac{({x - x_{g t})}^{2} + ({y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(12)

L_{W I o U v 2} = L_{I o U}^{r *} L_{W I o U v 1}, r > 0

(13)

L_{W I o U v 2} = {(\frac{L_{I o U}^{*}}{L_{I o U}})}^{r} L_{W I o U v 1}

(14)

(3): Wise-ShapeIoU

Wise-ShapeIoU combines the weighting mechanism of Wise-IoU and the shape feature mechanism of ShapeIoU, thus further enhancing the ability to evaluate target detection performance. Underwater fish are usually densely distributed, and the targets are easily occluded from each other in varying amounts. Through a weighting mechanism, a higher weight is assigned to a few categories of targets to improve the model’s detection ability for a few fish, and the weights of overlapping regions are dynamically adjusted to reduce the impact of occlusion on the detection results. Combined with shape features, dense targets can be separated more accurately to improve the detection accuracy and optimize the model’s ability to adapt to changes in fish postures.

2.3.4. PConv Module

(1): Conv

The conventional convolution operation is a basic operation in convolutional neural networks. The basic principle is to apply one or more convolutional kernels to the input image or feature map to perform a sliding-window operation to extract local features. As shown in Figure 8, assume that the input layer is a 3-channel color image. After passing through a convolution layer containing 4 filters, the final output is 4 feature maps with the same dimensions as the input layer. At this point, the convolution layer has a total of 4 filters, each filter contains 3 kernels, and each kernel is 3 × 3 in size. Therefore, the number of parameters in the convolution layer is 108. At each position, the convolution kernel performs an element-by-element multiplication and summation operation with the local region of the input. For each output channel, a separate convolution kernel is used to perform a convolution operation with the corresponding regions of all the input channels to obtain an output channel. In conventional convolution, each convolution kernel is convolved with each local region across all input channels, making the convolution operation computationally intensive.

(2): DWConv

DWConv consists of two parts: deep convolution and point-by-point convolution (Figure 9). Similarly, when inputting a 3-channel color image, perform channel-wise convolution first. Three single-channel convolution kernels perform convolution calculations separately, outputting three single-channel feature maps. Therefore, the final output consists of three feature maps. One filter contains only a single 3 × 3 kernel; so, the number of parameters in the convolution layer is 27. Next, perform pointwise convolution using four 1 × 1 × 3 convolution kernels to convolve the feature maps obtained in the previous step, producing four new feature maps. Thus, the number of parameters in the convolution layer is 12. Therefore, the total number of parameters is 39. Deep convolution means that a separate convolution kernel is used for the convolution operation for each input channel, instead of using the same convolution kernel for all input channels. This significantly reduces the number of parameters in the convolution operation. Point-by-point convolution is a 1 × 1 convolution kernel that performs a convolution operation on the feature map after deep convolution and is usually used to combine different types of channel information, thus increasing the number of output channels. It is used to linearly combine the feature maps of the individual channels generated by the deep convolution.

(3): PConv:

The core idea of PConv is to perform better restoration by considering only valid pixel regions in a convolution operation and ignoring invalid regions (e.g., missing parts of images) [42]. In a standard convolution operation, the convolution kernel processes the entire input region, regardless of which regions are valid, and which are missing. In contrast, the convolution operation of PConv distinguishes between valid and missing pixels by introducing a mask. When convolution is performed, only the valid portion of the mask is used to compute the convolutional output. Therefore, the convolution kernel of PConv only weighs and sums the valid pixel regions, and the missing regions do not affect the convolution result.

3. Experimental Results and Analysis

3.1. Experimental Environment

The experiment was conducted in a river on the campus of Chengdu University of Technology. The experimental river is a slow-flowing freshwater environment with a water temperature of 11 °C on that day. A static water tank was also used to conduct a corresponding experiment. The training process hyperparameter settings are shown in Table 1. The software environment was Python 3.8.20 with the PyTorch 2024.2.3 open-source deep learning framework. The hardware environment was NVIDIA GeForce RTX 3050 Laptop GPU (Nvidia, Santa Clara, CA, USA). In order to better match the actual biological environment, we produced our own dataset for model training, and all fish images used in this study were sourced from publicly licensed online resources. All images were manually screened to remove blurry, low-resolution, or copyright-restricted content, resulting in a final dataset of 2898 images that met the quality standards. The dataset included images from various complex environments, such as with folded, occluded, densely populated fish schools and low-light conditions. The was divided into a training set and a test set in an 8:2 ratio. A total of 2898 images were labelled using the Labellmg tool v1.8.1, and the dataset had a total of 3276 labelled frames and 10 different labels, as shown in Figure 10. In this experiment, mosaic image augmentation was used to expand the dataset and reduce the risk of overfitting.

3.2. Ablation Experiment and Comparative Experiments

3.2.1. Calculation of Indicators

Precision is the ratio of the number of samples correctly predicted by a model for a particular category to the total number of samples predicted by the model for that category. Recall is the proportion of positive samples correctly predicted by a model from all results predicted as positive in the validation set. The average predictive value (

A P

) is a combined assessment of a model’s prediction accuracy and recall across categories. In this paper, we employed mAP@0.5 as the sole evaluation threshold. The formulas used are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

R e c a l l = \frac{T P}{T P + F N}

(16)

m A P = \frac{1}{m} \sum_{i = 1}^{m} A P_{i}

(17)

3.2.2. Comparative Analysis

All models were trained on the same custom dataset for a fair comparison. The specific experimental comparisons were as follows:

(1): We conducted a comparison experiment between LGFB and common attention modules, such as SE (Squeeze-and-Excitation Networks) and CBAM (Convolutional Block Attention Module), and the experimental data are shown in Table 2.

The comparison showed that LGFB displayed better target detection performance compared to the other two modules due to its ability to adaptively fuse local details and global context information.

(2): We conducted comparative experiments on Conv, DWConv, and PConv, and the results of the experiments are shown in Table 3.

The comparison showed that PConv was able to deal with the problem of missing data efficiently because it focuses on the effective pixels instead of the entire convolution window, which significantly improves the performance and reduces the amount of computation compared to the other two convolutions.

(3): In order to better validate and demonstrate the effectiveness of this improved model, we compared YOLO-PWSL with three target detection models to extend our comparison, including SSD, Faster R-CNN, and Yolov5s, and the results of the tests are shown in Table 4.

The YOLO-PWSL model was compared with SSD, Faster R-CNN, and YOLOv5s. The experimental results demonstrated that the improved YOLO-PWSL model achieved higher mAP@0.5 with lower FLOPs and fewer parameters, showing higher effectiveness in detecting fish within complex backgrounds.

However, its parameter count and computational cost remain relatively high, indicating the need for further lightweight optimization to enhance efficiency.

3.2.3. Independence and Synergy Analysis

We conducted ablation experiments to evaluate the contribution of each of the three modules to improving the model performance. In this paper, PConv, Wise-ShapeIoU, and LGFB were added to the model sequentially. The experimental design and data are shown in Table 5, while a 3D scatter plot of the experimental data is shown in Figure 11.

(a): Independent Effect Analysis

PConv effectively addresses missing-data issues (e.g., occlusion, turbid backgrounds) by relying solely on valid pixels during computation, enhancing feature extraction and improving computational speed. PConv uses selective convolution without additional branches, which not only does not increase FLOPs, but also significantly reduces FLOPs. It can be observed that adding the PConv module increased mAP@0.5 by 2%, while reducing floating-point operations (FLOPs) by 4.2 G. The significant reduction in computational load directly resulted in a noticeable increase in inference speed, making it the model with the highest FPS of all models.

Wise-ShapeIoU optimizes detection box accuracy by dynamically adjusting the loss calculation method for bounding box regression, incorporating target shape characteristics, which resulted in a 1.4% improvement in mAP@0.5. This more complex calculation also needs to be performed during model inference (prediction), which adds a small amount of computational overhead, resulting in a slight decrease in FPS compared to the baseline.

LGFB employs a hierarchical attention mechanism and feature fusion pathways to achieve the adaptive integration of local details and global contextual information, thereby enhancing the perception capability in complex scenes. The core computational overhead of LGFB is concentrated in the dual-branch attention mechanism and feature concatenation and fusion, which significantly increases FLOPs. Adding additional network layers or complex fusion operations increases the computational complexity and number of parameters of the model, resulting in a significant decrease in FPS.

(b): Synergistic Effect Analysis

By integrating PConv, LGFB, and Wise-ShapeIoU, an improvement of 3% in mAP@0.5 was observed, along with a reduction of 3.1 G in FLOPs. PConv employs partial convolution to effectively filter underwater noise and occlusions, extracting cleaner features. LGFB leverages a local–global attention mechanism to adaptively fuse multi-scale features, enhancing the model’s perception of fish swarm size and distribution. Wise-ShapeIoU optimizes localization accuracy through shape-aware bounding box regression, particularly improving detection in dense and deformed fish swarms.

The acceleration effect of PConv was partially offset by the deceleration effect of LGFB. The final speed was lower than that of the PConv model alone (Model 2), but much higher than that achieved with Model 4, which included LGFB, and higher than that obtained with Model 3, which included a deceleration factor. It represents a compromise between accuracy and speed.

The three methods work together to optimize anti-interference feature extraction, multi-scale feature fusion, and high-precision localization, enabling the model to achieve higher performance in complex underwater environments with turbidity, occlusion, and multi-scale variations.

3.2.4. Analysis of the Visualization Results

As shown in Figure 12, the training results before and after model improvement were visualized under normal conditions, complex environments, folding, fish aggregation, and dark conditions.

It can be observed that in the normal situation, complex environment situation, folded situation, and dark situation, the improved model still showed higher accuracy than the original model, even though both models recognized the objects without leakage detection. In the case of fish aggregation and mutual occlusion, the original model reported multiple missed detections, while YOLO-PWSL successfully recognized most of the targets, but there were still a few missed detections. This is because the original model cannot deal with the problem of occlusion and complex environment well, while YOLO-PWSL dynamically adjusts the loss calculation by relying on effective pixels and multi-level attention fusion and adding shape features, which effectively solves the problem of large-area occlusion and significantly reduces the occurrence of leakage. However, the algorithm still showed a certain amount of leakage detection, which needs to be further improved in the future. We analyzed the cause of the missed detection and found that it was due to similar backgrounds being identified as the same target and severe folding occlusion. We believe that further data augmentation or temporal smoothing may be helpful, which will also be a direction for future experiments.

3.3. Bionic Mechanical Fish

3.3.1. Structure and Assembly

In this study, a bionic robotic fish with real-time environment monitoring, target recognition, and underwater operation capabilities is proposed. The robotic fish was integrally shaped using 3D modelling and printing technology to achieve the efficient integration of core components such as the shell, pectoral fins, caudal fins, and caudal fin motion module, and innovatively equipped with a foldable rotating and retractable robotic arm system, which significantly enhances the adaptability and execution capability of underwater tasks, as shown in Figure 13. In terms of the sensing system, the robotic fish integrates multiple sensor modules for real-time environment detection, the data is transmitted to the host computer in real time via Bluetooth real-time transmission to the supercomputer, and a camera module is present, which can realize the intelligent recognition of underwater fish species.

3.3.2. System Control

The control system of this bionic robotic fish is mainly divided into four parts: in the depth control layer, a dual DC motor-driven injector system is used, as shown in Figure 14, which, together with the PID control algorithm, achieves a high-precision depth adjustment of ±5 cm and can flexibly control the upward and downward surfacing and diving of the robotic fish. In the propulsion system layer, the tail fin swinging frequency is adjusted to achieve 10–20 cm/s adjustable speed travelling. In the operation system layer, a four-rudder-driven foldable robotic arm is configured with 360° omnidirectional rotation and 15 cm telescopic stroke to meet the needs of complex underwater tasks. This layered control architecture realizes the precise and coordinated control of the robotic fish’s movement system. At the image transmission layer, the detection model runs comprehensively, but due to other functions included in actual applications, the frame rate (FPS) is slightly reduced, reaching approximately 54.4 frames per second (average of multiple measurements). The feasibility of the model was verified based on the response time of the robotic arm being less than 0.2 s, and the delay of underwater information return being less than 0.2 s.

3.3.3. Visual Recognition

Fish recognition is achieved using an improved YOLO-PWSL model, which improves the localization accuracy of irregular fish bodies through Wise-ShapeIoU, local–global attention fusion using LGFB to capture key features, and PConv to reduce the number of parameters while maintaining the feature expressiveness. The improved model can effectively detect fish in complex situations, thereby interacting with the host computer to match the corresponding environmental conditions for each fish species. This enables the robotic fish to reach the appropriate depth for water quality testing.

3.3.4. Water Quality Monitoring Performance

To verify the accuracy of real-time water quality testing, the bionic robotic fish was used to test the water, and the obtained values of water pressure (accuracy: ±1%), temperature (accuracy: ±0.5 °C), and turbidity (error less than 1% F.S., full scale, 500 ppm) were compared with the standard straightedge (minimum accuracy of 0.5 mm), thermometer (minimum accuracy of 0.5 degrees Celsius), and turbidity solution, purchased from Guangjian Standard Solution Wholesale; the errors were calculated and considered for adjustments. According to the experimental results, the measured values of this bionic fish machine had some errors compared to the actual values (Table 6). This error is caused by the accuracy error of the measurement sensor itself and the difference in manual estimation. The uncertainty of temperature is 0.82 °C, and the uncertainty of depth is 0.60 mm.

4. Discussion

The YOLO-PSWL object detection algorithm proposed in this study employs a local–global feature fusion block (LGFB) to achieve multi-level feature fusion, significantly enhancing the model’s detection capabilities in turbid and occluded environments. The LGFB mechanism leverages local attention to capture detailed features while using global attention to correlate fish bodies with their surrounding environment, thereby suppressing folding and dense interference. Additionally, the algorithm incorporates shape constraints (ShapeIoU) into Wise-IoU and introduces a dynamic weighting strategy, significantly improving the localization accuracy of irregular fish shapes. By replacing standard convolutions with PConv convolutions, which only use effective pixels for computation, the model achieves higher accuracy while reducing the number of parameters. The multiple experiments in Table 2, Table 3 and Table 4 fully validated the effectiveness of each improved module, while the ablation experiments in Table 5 validated the effectiveness of the improved model, whose detection accuracy appeared to be significantly better than that of the previous algorithm model. Although the dual-branch attention mechanism and feature concatenation fusion operation of LGFB slightly increase the computational overhead, their impact on the model’s overall computational cost is limited. More importantly, PConv uses selective convolution without additional branches, which not only does not increase FLOPs but also significantly reduces them. Therefore, the application of PConv effectively suppresses the computational cost increase caused by LGFB. As verified in Figure 12, this model can effectively improve the detection capabilities in complex environments, with folding and dense fish schools. However, it still requires a large amount of computational resources. Its various functions significantly affect the overall real-time performance. To achieve lower costs, it needs to be further lightweighted.

This method was used in a biomimetic robotic fish system produced using 3D printing technology. It generates thrust by periodically inducing oscillations of the tail fin composed of two rudders, thereby reducing interference with fish schools. By adjusting the duty cycle of different servos, various motion modes can be achieved. It achieves the precise control of hovering and buoyancy through a depth sensor combined with a PID algorithm, providing a dynamically stable image acquisition environment for the YOLO-PSWL algorithm. Additionally, the system is equipped with multiple water quality monitoring sensors and a folding robotic arm, enabling real-time monitoring and operations on water samples. Compared to previous systems, it offers dynamic monitoring of water quality data, more stable hovering capabilities, and the ability to match environmental conditions to specific fish species, thereby better addressing water quality pollution issues in aquaculture. In addition, we will strive to improve the waterproof performance of the bionic fish system to achieve higher mechanical durability and sealing performance. For example, ST material can be used, which has better water resistance.

5. Conclusions

This study proposes a novel underwater monitoring and detection system: the YOLO-PWSL-enhanced robotic fish. We first designed a biomimetic robotic fish equipped with multiple sensors and a folding robotic arm for dynamic water quality monitoring. Second, we designed an underwater hovering method that combines a PID controller with a depth sensor to provide a dynamically stable environment for the system. Finally, the we propose a target detection model, YOLO-PWSL, based on an attention mechanism fusion. The experimental results showed that the improved YOLOv5 target detection algorithm achieved a detection accuracy of 96.1% (mAP@0.5), enabling precise identification in complex environments. In practical applications, optimization for occluded scenes can reduce missed detections in dense fish schools, providing an alternative to traditional manual inspections. The experimental results validated the feasibility of this system platform. This work provides a valuable foundation for the future development of more efficient and intelligent sustainable water quality monitoring and ecological protection, offering new solutions for intelligent aquaculture in the aquaculture industry.

Author Contributions

Conceptualization, L.L. and W.Z.; data curation, L.L. and H.H.; supervision, L.L. and Y.T.; writing—original draft, L.L. and Q.T.; writing—review and editing, L.L. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National College Students Innovation and Entrepreneurship Training Program (202410616037).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest. All images are from the public domain or have been authorized and do not infringe on privacy or copyright. The dataset is for academic research purposes only.

References

Shi, Z.; Xue, D.; Xu, J. Global Marine Product Space and Coastal Countries’ Productive Capabilities, 1995–2021. Land 2025, 14, 378. [Google Scholar] [CrossRef]
Qin, A.; Ning, D. Developments, Applications, and Innovations in Agricultural Sciences and Biotechnologies. Appl. Sci. 2025, 15, 4381. [Google Scholar] [CrossRef]
Liu, Z.; Wang, C.; Guo, B. Biodiversity—The Cornerstone of Sustainable Aquaculture Development: Insights From the Breeding of Approved Fish Varieties for Aquaculture From 1996 to 2024 in China. Rev. Aquac. 2025, 17, e70003. [Google Scholar] [CrossRef]
Wang, Q.; Liu, H.; Sui, J. Mariculture: Developments, present status and prospects. In Aquaculture in China: Success Stories and Modern Trends; Wiley: Hoboken, NJ, USA, 2018; pp. 38–54. [Google Scholar]
Wang, Y.; Zheng, Y.; Qian, X.; Yang, X.; Chen, J.; Wu, W. Aquaculture Wastewater Pollution and Purification Technology in China: Research Progress. J. Agric. 2022, 12, 65–70. [Google Scholar]
Ubina, N.A.; Cheng, S.-C. A Review of Unmanned System Technologies with Its Application to Aquaculture Farm Monitoring and Management. Drones 2022, 6, 12. [Google Scholar] [CrossRef]
Parra, L.; Sendra, S.; Garcia, L.; Lloret, J. Smart Low-Cost Control System for Fish Farm Facilities. Appl. Sci. 2024, 14, 6244. [Google Scholar] [CrossRef]
Ullah, I.; Adhikari, D.; Khan, H.; Anwar, M.S.; Ahmad, S.; Bai, X. Mobile robot localization: Current challenges and future prospective. Comput. Sci. Rev. 2024, 53, 100651. [Google Scholar] [CrossRef]
Huang, Y.-P.; Khabusi, S.P. Artificial Intelligence of Things (AIoT) Advances in Aquaculture: A Review. Processes 2025, 13, 73. [Google Scholar] [CrossRef]
Davis, A.; Wills, P.S.; Garvey, J.E.; Fairman, W.; Karim, M.A.; Ouyang, B. Developing and Field Testing Path Planning for Robotic Aquaculture Water Quality Monitoring. Appl. Sci. 2023, 13, 2805. [Google Scholar] [CrossRef]
Ullah, I.; Ali, F.; Sharafian, A.; Ali, A.; Naeem, H.Y.; Bai, X. Optimizing underwater connectivity through multi-attribute decision-making for underwater IoT deployments using remote sensing technologies. Front. Mar. Sci. 2024, 11, 1468481. [Google Scholar] [CrossRef]
Ma, F.; Fan, Z.; Nikolaeva, A.; Bao, H. Redefining Aquaculture Safety with Artificial Intelligence: Design Innovations, Trends, and Future Perspectives. Fishes 2025, 10, 88. [Google Scholar] [CrossRef]
Ma, S.; Zhao, Q.; Ding, M.; Zhang, M.; Zhao, L.; Huang, C.; Zhang, J.; Liang, X.; Yuan, J.; Wang, X.; et al. A Review of Robotic Fish Based on Smart Materials. Biomimetics 2023, 8, 227. [Google Scholar] [CrossRef] [PubMed]
Singh, N.; Gupta, A.; Mukherjee, S. A dynamic model for underwater robotic fish with a servo actuated pectoral fin. SN Appl. Sci. 2019, 1, 659. [Google Scholar] [CrossRef]
Duraisamy, P.; Kumar Sidharthan, R.; Nagarajan Santhanakrishnan, M. Design, Modeling, and Control of Biomimetic Fish Robot: A Review. J. Bionic Eng. 2019, 16, 967–993. [Google Scholar] [CrossRef]
Wang, J. Robotic Fish: Development, Modeling, and Application to Mobile Sensing; ProQuest: Ann Arbor, MI, USA, 2014. [Google Scholar]
Tong, X.; Tang, C. Design of a monitoring system for robotic fish in underwater environment. Int. J. Veh. Inf. Commun. Syst. 2017, 1, 321. [Google Scholar] [CrossRef]
Costa, D.; Palmieri, G.; Palpacelli, M.-C.; Panebianco, L.; Scaradozzi, D. Design of a Bio-Inspired Autonomous Underwater Robot. J. Intell. Robot. Syst. 2018, 91, 181–192. [Google Scholar] [CrossRef]
Chen, X.; Li, D.; Mo, D.; Cui, Z.; Li, X.; Lian, H.; Gong, M. Three-Dimensional Printed Biomimetic Robotic Fish for Dynamic Monitoring of Water Quality in Aquaculture. Micromachines 2023, 14, 1578. [Google Scholar] [CrossRef]
Zhao, D.; Lu, K.; Qian, W. Hydrodynamic Resistance Analysis of Large Biomimetic Yellow Croaker Model: Effects of Shape, Body Length, and Material Based on CFD. Fluids 2025, 10, 107. [Google Scholar] [CrossRef]
Huang, X.; Zhang, Y.; Chen, X.; Kong, X.; Liu, B.; Jiang, S. Compatibilities of Cyprinus carpio with Varied Colors of Robotic Fish. Fishes 2024, 9, 211. [Google Scholar] [CrossRef]
Tanev, I. Speed and Energy Efficiency of a Fish Robot Featuring Exponential Patterns of Control. Actuators 2025, 14, 119. [Google Scholar] [CrossRef]
Nayoun, M.N.I.; Hossain, S.A.; Rezaul, K.M.; Siddiquee, K.N.e.A.; Islam, M.S.; Jannat, T. Internet of Things-Driven Precision in Fish Farming: A Deep Dive into Automated Temperature, Oxygen, and pH Regulation. Computers 2024, 13, 267. [Google Scholar] [CrossRef]
Misbahuddin, M.; Cokrowati, N.; Iqbal, M.S.; Farobie, O.; Amrullah, A.; Ernawati, L. Kalman Filter-Enhanced Data Aggregation in LoRaWAN-Based IoT Framework for Aquaculture Monitoring in Sargassum sp. Cultivation. Computers 2025, 14, 151. [Google Scholar] [CrossRef]
Hussain, A.; Hussain, T.; Ullah, I.; Muminov, B.; Khan, M.Z.; Alfarraj, O.; Gafar, A. CR-NBEER: Cooperative-Relay Neighboring-Based Energy Efficient Routing Protocol for Marine Underwater Sensor Networks. J. Mar. Sci. Eng. 2023, 11, 1474. [Google Scholar] [CrossRef]
Wei, L.; Dragomir, A.; Dumitru, E.; Christian, S.; Scott, R.; Cheng-Yang, F.; Berg, A.C. Ssd: Single shot multibox detector. In Lecture Notes in Computer Science; Springer: Cham, Germany, 2016. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Khan, S.; Ullah, I.; Ali, F.; Shafiq, M.; Ghadi, Y.Y.; Kim, T. Deep learning-based marine big data fusion for ocean environment monitoring: Towards shape optimization and salient objects detection. Front. Mar. Sci. 2023, 9, 1094915. [Google Scholar] [CrossRef]
Yan, Z.; Hao, L.; Yang, J.; Zhou, J. Real-Time Underwater Fish Detection and Recognition Based on CBAM-YOLO Network with Lightweight Design. J. Mar. Sci. Eng. 2024, 12, 1302. [Google Scholar] [CrossRef]
Dinakaran, R.; Zhang, L.; Li, C.-T.; Bouridane, A.; Jiang, R. Robust and Fair Undersea Target Detection with Automated Underwater Vehicles for Biodiversity Data Collection. Remote Sens. 2022, 14, 3680. [Google Scholar] [CrossRef]
Li, Z.; Zheng, B.; Chao, D.; Zhu, W.; Li, H.; Duan, J.; Zhang, X.; Zhang, Z.; Fu, W.; Zhang, Y. Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism. J. Mar. Sci. Eng. 2024, 12, 2291. [Google Scholar] [CrossRef]
Tong, C.; Li, B.; Wu, J.; Xu, X. Developing a Dead Fish Recognition Model Based on an Improved YOLOv5s Model. Appl. Sci. 2025, 15, 3463. [Google Scholar] [CrossRef]
Diamanti, E.; Ødegård, Ø. Visual sensing on marine robotics for the 3D documentation of Underwater Cultural Heritage: A review. J. Archaeol. Sci. 2024, 166, 105985. [Google Scholar] [CrossRef]
Lu, Y.; Chen, X.; Wu, Z.; Yu, J.; Wen, L. A novel robotic visual perception framework for underwater operation. Front. Inf. Technol. Electron. Eng. 2022, 23, 1602–1619. [Google Scholar] [CrossRef]
Liu, Y.; Liu, Z.; Yang, H.; Liu, Z.; Liu, J. Design and Realization of a Novel Robotic Manta Ray for Sea Cucumber Recognition, Location, and Approach. Biomimetics 2023, 8, 345. [Google Scholar] [CrossRef]
Wang, S.; Han, Y.; Mao, S. Innovation Concept Model and Prototype Validation of Robotic Fish with a Spatial Oscillating Rigid Caudal Fin. J. Mar. Sci. Eng. 2021, 9, 435. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]

Figure 1. (a) Bionic robotic fish architecture; (b) top internal cutaway view; (c) left internal cutaway view; (d) main chamber.

Figure 2. (a) Main chamber; (b) main chamber’s internal architecture.

Figure 3. (a) STM32H7B0VBT6 system board; (b) power system; (c) reservoir module; (d) water pressure sensor; (e) turbidity sensor; (f) temperature sensor; (g) image transmission module; (h) communication module; (i) robotic arm.

Figure 4. Tail fin motion model.

Figure 5. Circuit flow diagram.

Figure 6. YOLO-PWSL network structure.

Figure 7. LGFB structure.

Figure 8. Conv structure.

Figure 9. DWConv structure.

Figure 10. Tag information.

Figure 11. Three-dimensional scatter plot of ablation experiment.

Figure 12. Visual analytics structure diagram.

Figure 13. On-site swimming behavior testing.

Figure 14. Sinking system. (a) Modelling of DC motor + syringe; (b) modelling of water storage bin; (c) physical drawing of DC motor + syringe; (d) physical drawing of water storage bin.

Table 1. Hyperparameter settings.

Parameters	Values
Image Size	640 × 640
Batch size	4
Learning rate	0.01
Momentum factor	0.937
Weight decay coefficient	0.0005
Iterations	120

Table 2. Comparison of attention mechanisms.

Method	P	R	mAP50
SE	0.919	0.873	0.918
CBAM	0.877	0.919	0.941
LGFB	0.915	0.903	0.948

Table 3. Three convolution comparison experiments.

Method	P	R	mAP50	GFLOPs
Conv	0.882	0.891	0.931	24.1
DWConv	0.921	0.862	0.935	20.3
PConv	0.914	0.924	0.951	19.9

Table 4. Comparative experiments.

Method	P	R	mAP50	GFLOPs	Params (M)
SSD	0.867	0.833	0.868	62.7	26.28
Faster R-CNN	0.732	0.912	0.906	370.2	137.09
Yolov5s	0.882	0.891	0.931	24.1	9.11
Ours	0.929	0.897	0.961	21.0	7.31

Table 5. Ablation experiments.

Model	mAP50	GFLOPs	FPS
M1	0.931	24.1	81.65
M2	0.951	19.9	98.49
M3	0.945	24.1	79.52
M4	0.948	25.1	66.48
M5 M6	0.958 0.961	19.9 21	95.33 87.6

M1: Yolov5s. M2: Yolov5s and PConv. M3: Yolov5s and Wise-ShapeIoU. M4: Yolov5s and LGFB. M5: Yolov5s, PConv, and Wise-ShapeIoU. M6: Yolov5s, PConv, Wise-ShapeIoU, and LGFB.

Table 6. Water quality test data.

Samples	Water Temperature (°C)			Water Pressure (mm)			Turbidity (ppm)
Samples	Test	Actual	Error	Test	Actual	Error	Test	Actual	Error
1	43.00	43.24	0.24	305	309	4	9.897	10	0.103
2	40.47	41.12	0.65	297	301	4	19.905	20	0.095
3	40.78	40.41	0.37	232	237	5	29.841	30	0.159
4	39.46	39.50	0.04	182	188	6	39.885	40	0.115
5	38.54	38.61	0.07	147	155	8	49.877	50	0.123
6	35.23	35.43	0.20	120	123	3	54.844	55	0.156
7	27.13	27.27	0.14	83	88	5	69.881	70	0.119
8	25.02	25.34	0.32	45	43	2	74.859	75	0.141
9	22.69	22.93	0.24	25	14	11	99.87	100	0.130

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lei, L.; Tang, Y.; Zhang, W.; Tang, Q.; Hao, H. YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring. Appl. Sci. 2025, 15, 7052. https://doi.org/10.3390/app15137052

AMA Style

Lei L, Tang Y, Zhang W, Tang Q, Hao H. YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring. Applied Sciences. 2025; 15(13):7052. https://doi.org/10.3390/app15137052

Chicago/Turabian Style

Lei, Lingrui, Ying Tang, Weidong Zhang, Quan Tang, and Haichi Hao. 2025. "YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring" Applied Sciences 15, no. 13: 7052. https://doi.org/10.3390/app15137052

APA Style

Lei, L., Tang, Y., Zhang, W., Tang, Q., & Hao, H. (2025). YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring. Applied Sciences, 15(13), 7052. https://doi.org/10.3390/app15137052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-PWSL-Enhanced Robotic Fish: An Integrated Object Detection System for Underwater Monitoring

Abstract

1. Introduction

2. Materials and Methods

2.1. Design of a Bionic Mechanical Fish

2.1.1. Overall Structure of the Bionic Fish

2.1.2. Bionic Fish Composition Modules

2.2. Tail Fin Swing Design

2.2.1. Tail Fin Drive Design

2.2.2. Control Design

2.2.3. Power Control

2.2.4. Circuit Design

2.3. YOLO-PWSL

2.3.1. Architecture of YOLO-PWSL

2.3.2. LGFB (Local–Global Fusion Block) Module

2.3.3. Wise-ShapeIoU Loss Function

2.3.4. PConv Module

3. Experimental Results and Analysis

3.1. Experimental Environment

3.2. Ablation Experiment and Comparative Experiments

3.2.1. Calculation of Indicators

3.2.2. Comparative Analysis

3.2.3. Independence and Synergy Analysis

3.2.4. Analysis of the Visualization Results

3.3. Bionic Mechanical Fish

3.3.1. Structure and Assembly

3.3.2. System Control

3.3.3. Visual Recognition

3.3.4. Water Quality Monitoring Performance

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI