Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms

Jiang, Hanyu; Zhao, Jing; Ma, Fuyu; Yang, Yan; Yi, Ruiwen

doi:10.3390/fishes10070348

Open AccessArticle

Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms

by

Hanyu Jiang

^1,2

,

Jing Zhao

^1,2,

Fuyu Ma

²,

Yan Yang

^1,2 and

Ruiwen Yi

^2,*

¹

State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Fishes 2025, 10(7), 348; https://doi.org/10.3390/fishes10070348

Submission received: 30 May 2025 / Revised: 7 July 2025 / Accepted: 10 July 2025 / Published: 14 July 2025

(This article belongs to the Special Issue Technology for Fish and Fishery Monitoring)

Download

Browse Figures

Versions Notes

Simple Summary

To achieve the accurate and rapid recognition of aquatic organisms in underwater environments, this study proposes a lightweight model, Mobile-YOLO, designed specifically for the detection of four types of aquatic species, namely holothurian, echinus, scallop, and starfish. Our model first introduces a novel lightweight backbone network, Mobile-Nano, which enhances feature perception while maintaining a compact structure. Furthermore, we propose a lightweight detection head, LDtect, which balances model compression with high detection accuracy. In addition, we incorporate Dysample (dynamic sampling) and HWD (Haar wavelet downsampling) modules to optimize the feature fusion structure by improving the upsampling and downsampling processes, further contributing to the model efficiency. Compared to the baseline model, our method achieves a 32.2% reduction in parameters, a 28.4% reduction in FLOPs (floating point operations), a 30.8% reduction in weights (model storage size), and a 95.2% improvement in FPS (frames per second), along with a 1.6% increase in mAP (mean average precision). When compared with mainstream models such as the YOLO (You Only Look Once) series (YOLOv5-12), SSD (Single Shot MultiBox Detector), EfficientDet (Efficient Detection), RetinaNet, and RT-DETR (Real-Time Detection Transformer), Mobile-YOLO achieves leading overall performance in terms of both accuracy and lightweight design. These results demonstrate that our approach provides effective technical support for the precise and rapid recognition of aquatic organisms.

Abstract

Accurate and rapid aquatic organism recognition is a core technology for fisheries automation and aquatic organism statistical research. However, due to absorption and scattering effects, images of aquatic organisms often suffer from poor contrast and color distortion. Additionally, the clustering behavior of aquatic organisms often leads to occlusion, further complicating the identification task. This study proposes a lightweight object detection model, Mobile-YOLO, for the recognition of four representative aquatic organisms, namely holothurian, echinus, scallop, and starfish. Our model first utilizes the Mobile-Nano backbone network we proposed, which enhances feature perception while maintaining a lightweight design. Then, we propose a lightweight detection head, LDtect, which achieves a balance between lightweight structure and high accuracy. Additionally, we introduce Dysample (dynamic sampling) and HWD (Haar wavelet downsampling) modules, aiming to optimize the feature fusion structure and achieve lightweight goals by improving the processes of upsampling and downsampling. These modules also help compensate for the accuracy loss caused by the lightweight design of LDtect. Compared to the baseline model, our model reduces Params (parameters) by 32.2%, FLOPs (floating point operations) by 28.4%, and weights (model storage size) by 30.8%, while improving FPS (frames per second) by 95.2%. The improvement in mAP (mean average precision) can also lead to better accuracy in practical applications, such as marine species monitoring, conservation efforts, and biodiversity assessment. Furthermore, the model’s accuracy is enhanced, with the mAP increased by 1.6%, demonstrating the advanced nature of our approach. Compared with YOLO (You Only Look Once) series (YOLOv5-12), SSD (Single Shot MultiBox Detector), EfficientDet (Efficient Detection), RetinaNet, and RT-DETR (Real-Time Detection Transformer), our model achieves leading comprehensive performance in terms of both accuracy and lightweight design. The results indicate that our research provides technological support for precise and rapid aquatic organism recognition.

Keywords:

aquatic organism recognition; lightweight backbone; upsample; downsample; lightweight detect head

Key Contribution: A new exploration of intelligent fisheries: a new perspective on the application of underwater target recognition in intelligent fisheries is presented here.

1. Introduction

Aquatic organism recognition is increasingly being applied in aquaculture, such as in automated fishing and aquatic organism statistics. In automated fishing, this technology enables systems to accurately identify target species, allowing for selective harvesting. In aquatic organism statistics, recognition technology allows automated systems to monitor population numbers and distribution, ensuring the sustainability of fishery resources. Additionally, some marine biologists have begun using this technology to study the abundance and species diversity of aquatic animals.

However, due to light absorption and scattering effects, underwater images and videos often exhibit color distortion and fine blurring, reducing the ability to distinguish features. Underwater images also tend to have significant noise, such as tiny algae or floating particles in the water. Moreover, the clustering behavior of aquatic organisms can cause occlusion, leading to missed detections. As a result, accurately identifying aquatic organisms remains a significant challenge.

In recent years, deep learning models have gained increasing popularity in the fields of fisheries and aquaculture, due to their efficient feature learning capabilities and their ability to effectively capture complex patterns. The research in [1] deepened the model’s architecture and proposed a novel neck structure, embedding a new module. The results indicated that this model could better monitor fishery resources and enhance aquaculture efficiency. Monitoring feed pellet utilization in aquaculture is crucial for effective resource management [2]. A new framework, was introduced, which integrates underwater object detection and image reconstruction using YOLO-V5. Addressing the issues of shallow network layers and insufficient feature extraction in the YOLO-V3-Tiny network [3], a network combining YOLO-V3-Tiny with MobileNet was proposed. Compared to traditional shore-based fishing vessel detection methods, this real-time monitoring approach offers timely warnings to fishing operators, which is more advantageous for fisheries ecological protection. The method [4] proposed a multi-stage R-CNN (region-based convolutional neural network) target detection network linked by sequential LSTM (long short-term memory), aimed at accurately counting fish in diverse benthic backgrounds and lighting conditions in the ocean. However, the accuracy of this method does not meet practical demands, and its computational complexity increases as the number of cascaded R-CNN structures rises. Another study [5] integrated deformable convolution modules (DCM) and adaptive thresholding modules (ATM) into the YOLOv5 framework to precisely detect and count red-finned triggerfish in aquaculture ponds. The DCM reduced background blur interference, while the ATM minimized detection misses in densely occluded scenes. The model achieved an accuracy of 97.53% and a recall rate of 98.09% on a real aquaculture pond dataset. In [6], a single-stage CNN detector, RetinaNet, was used for fish detection in underwater video frames, while a simple online real-time tracker (SORT) was employed to link detections of the same fish, achieving a 74% average precision (AP). Ref. [7] applied Mask R-CNN and the GOTURN (generic object tracking using regression networks) tracking algorithm to detect and track fish in videos captured by UAVs (unmanned aerial vehicles) above fish tanks, achieving an F1 score of 91% and 16 frames per second (FPS). Lastly, ref. [8] used three R-CNN architectures based on NASNet, ResNet-101, and MobileNet to detect mullets onshore, and estimated their total length based on per-pixel real length provided by ArUco benchmark markers. To overcome challenges posed by low-quality underwater images and the presence of small-sized feed pellets [9], an improved YOLO-V4 network was designed to accurately detect uneaten feed pellets in the underwater environments of Atlantic salmon seawater cages.

The aforementioned literature primarily focuses on improving model accuracy but overlooks the design and processing of models for mobile platforms. One study [10] proposed a lightweight Underwater-YOLO architecture for real-time fish school detection in unconstrained marine environments, specifically designed for underwater robot embedded systems. Another study [9] developed a lightweight deep learning architecture, YOLOv3-Lite, for the automatic detection of hunger and oxygen supply behaviors in crucian carp and catfish within fish tanks, utilizing MobileNetv2 as the backbone to simplify detection. A modified YOLOv3 model for fish detection [11] based on MobileNetV1 was developed to provide real-time detection and counting of red-finned triggerfish in aquaculture ponds, replacing conventional convolution operations in CNNs with depthwise separable convolutions to reduce model parameters. In [12], an intermediate-layer feature distillation method using CWD (class-weighted dice) loss was proposed, with a large YOLOv8s-FG as the teacher network and a small YOLOv8n-FG as the student network. The resulting YOLOv8n-DFG model demonstrated superior accuracy and real-time performance in experiments on a dataset containing six morphologically similar fish species, effectively meeting the requirements for real-time fine-grained fish recognition. Ref. [13] introduced a lightweight real-time detection method, YOLOv5s-SGC, for deck crew and fishing net detection. This method, based on the YOLOv5s model, uses surveillance cameras to capture fishing vessel operation videos and enhances the dataset. YOLOv5s-SGC replaced the YOLOv5s backbone with ShuffleNetV2, substituted the feature fusion network with an improved generalized FPN (feature pyramid network), and added a CBAM (convolutional block attention module) attention module before the detection head. Ref. [14] developed a real-time high-accuracy fish detection algorithm based on YOLOv5s. The model incorporated attention mechanisms and gated convolutions to enhance channel representations and spatial interactions, while introducing GhostNet to reduce the model weight size. The Light-YOLO model [15] proposed a network optimized for complex underwater lighting and visual conditions by integrating an attention mechanism based on sparse connections and deformable convolutions. The study also introduced a lightweight network architecture containing deformable convolutions, called Deformable FishNet [16]. This model incorporated an efficient global coordinate attention module (EGCA) and an EGCA-based deformable convolution network (EDCN/EC2f) to address fish body deformations caused by swimming movements. In [17], a lightweight backbone network was generated using YOLOv4 to ensure fast computation, with an attention mechanism introduced to suppress irrelevant features. Finally, ref. [18] proposed a novel lightweight module, the Sparse Ghost Module, designed specifically for accurate and real-time underwater organism detection. By replacing standard convolutions with the proposed convolutions, this approach significantly reduces the network complexity and enhances the inference speed, with minimal loss in detection accuracy.

Most existing studies focus either solely on improving accuracy [19,20,21,22] or solely on achieving lightweight design [23,24], while neglecting the trade-off between the two. This often results in models with improved accuracy but sharply increased complexity or lightweight models that sacrifice precision. Moreover, the datasets [25,26] used in these studies are often limited in scope, with simple categories and homogeneous environments. In contrast, our dataset is collected from complex underwater environments characterized by significant noise and diverse backgrounds. The four types of aquatic organisms considered in our study are particularly susceptible to interference from noise and background variations. Overly lightweight models may lack sufficient fitting capacity, leading to inadequate learning of discriminative features in such challenging conditions and ultimately resulting in reduced accuracy. In our study, we proposed a lightweight object detection model for aquatic organism recognition. Our model achieved 82.1% mAP, 416.7 FPS, with Params of only 2.04 M, FLOPs of just 5.8 G, and weights of only 4.13 MB. These characteristics make our model highly suitable for deployment on mobile devices. Our research advances the study of aquatic organism recognition models. The main contributions of our work are as follows:

(1) We propose a lightweight object detection model designed for aquatic organism recognition.

(2) We optimized the object detection model for lightweight performance and accuracy by using a Mobile-Nano backbone network, a lightweight detection head, the Dysample technique, and the HWD method.

(3) The experiments show that our model outperforms many excellent models, reaching a state-of-the-art level.

2. Materials and Methods

2.1. Dataset Introduction

The dataset adopted in this paper was collected from the waters near Zhangzi Island in Dalian, Liaoning Province, China. It came from the URPC underwater object detection algorithm competition co-hosted by the National Natural Science Foundation of China, the Dalian Municipal Government, and Pengcheng Laboratory. As shown in Table 1, this dataset is a multi-class dataset, consisting of four common types of aquatic fishery targets in total. Among them, the targets labeled as “echinus” have 17,735 ground-truth bounding boxes, which is the largest number and exceeds the sum of the other three types. The targets labeled as “holothurian” have 4488 ground-truth bounding boxes, those labeled as “scallop” have 5345 ground-truth bounding boxes, and the targets labeled as “starfish” have 5391 ground-truth bounding boxes.

We divided the dataset into a training set, a validation set, and a testing set. The training set is used for training to learn the features of the targets. The validation set is employed to quantify the learning ability during the training process and assist us in selecting the best model for testing. The testing set is utilized for the final testing to obtain the object detection evaluation metrics of the model and assess its generalization ability. The division situation of our dataset can be seen in Table 2. We provide a detailed description of the numerical composition of the training set. Specifically, the dataset contains 15,686 annotated bounding boxes for echinus, 3948 for holothurian, 4784 for scallop, and 4814 for starfish.

Specifically, the training set contains 3880 underwater images, the validation set has 554 images, and the testing set includes 1109 images.

2.2. YOLOv8 Object Detection Network

YOLOv8 is an efficient single-stage object detection model, which is divided into five versions, namely n, s, m, l, and x. The depth and width of each version increase successively, while their architectures remain the same. The architecture diagram can be seen in Figure 1.

The backbone network of YOLOv8 utilizes Conv and C2f modules to extract lightweight and information-rich gradient flow features at different scales. Through the Conv module, the feature maps are continuously downsampled, and the feature maps keep shrinking. Meanwhile, the backbone network also obtains features at large, medium, and small scales. The C2f module is an improvement on the C3 module in the YOLOv5 model. It incorporates the advantages of the ELAN structure in YOLOv7 and optimizes the extraction of gradient flow features. The compressed feature maps are sent to the SPFF module, where pooling operations with different kernel sizes are used to comprehensively capture and extract features. Then, these feature maps are sent to the subsequent head network for feature fusion and upsampling processing. In terms of the detection head structure, a decoupled head design is adopted, separating the classification head from the detection head. It has transitioned from the anchor-based design of the YOLOv5 model to the anchor-free design of YOLOv8, enhancing the flexibility of the model. However, this also introduces a large number of parameters and complexity as a result.

2.3. Mobile-YOLO Object Detection Network

In practical applications, especially in resource-constrained mobile devices, the lightweight nature of the model is of particular importance. A lightweight model can enhance the inference speed, reduce the dependence on hardware and the power consumption requirements, and also decrease the storage pressure, facilitating deployment and update. Based on this fundamental fact, we propose an object detection model suitable for mobile devices and name this lightweight, yet highly efficient and accurate, deep learning model Mobile-YOLO. Its detailed structure diagram can be seen in Figure 2.

In our work, based on MobileNetv4 which is suitable for mobile platforms [27], we have developed an even more lightweight backbone. This backbone is lighter than the small version in the paper and has a remarkable effect on our detection task. We name this backbone Mobile-Nano. Mobile-Nano consists of the Fused_IB module, ExtraDW module, IB module, and SPFF module. Compared with the backbone of YOLOv8, it is lighter, has a better feature extraction effect, and has a simpler structure. In addition, after comparing the structures of YOLOv5 and YOLOv8, we found that a large number of parameters and computational complexity were introduced after replacing the coupled head with a decoupled head. Accordingly, we simplified the decoupled head of YOLOv8 to further achieve lightweighting. In the head part of YOLOv8, the task of feature fusion needs to be completed. However, due to the inevitable loss of detailed features in the ordinary downsampling and upsampling processes, the fusion effect is not satisfactory. Therefore, we introduced Dysample [28], formulating upsampling from the perspective of point sampling, which is more resource-efficient and brings an impressive performance improvement. Moreover, we introduced the HWD [29] module. This module applies the Haar wavelet transform to reduce the spatial resolution of the feature map while retaining as much information as possible. By using this special downsampling method, we further lightweighted the model and improved the model’s accuracy at the same time.

2.4. Mobile-Nano Backbone Network

The foundation of MobileNetv4 is constructed by the universal inverted bottleneck (UIB) block and other components. Here, we will focus on introducing the UIB module and the Fused_IB module, as these two modules are the core components of our Mobile-Nano. The detailed design can be seen in Figure 3.

Firstly, the UIB module is a building block for efficient network design. It can adapt to various optimization objectives while remaining simple enough. The UIB extends the IB module proposed in MobileNetv2 [30], and this IB module has become a standard building block for efficient networks and is widely used in various neural networks. The UIB module has four possible structures, namely ExtraDW, IB, ConvNext, and FFN. These are generated due to the permutations and combinations of the depthwise components. Scholars further use Fused_IB to improve the efficiency: a k × k Fused_IB module is equivalent to an ordinary convolution with a convolution kernel of K and a pointwise convolution. Since the lightest version of MobileNetv4, MNv4-Conv-S, has a deeper depth than the original backbone network of YOLOv8, and the feature map transformation is quite different from that of the original backbone network, we hope to make fewer modifications to the rest of the YOLOv8 structure, because this can reduce the determination and optimization of parameters. Therefore, based on the core components of MobileNetv4, we have created a network that is shallower and lighter than MNv4-Conv-S. The detailed design of our Mobile-Nano can be seen in Table 3.

This backbone network utilizes the core components of MobileNetv4. Finally, under the action of the SPFF module in YOLOv8, pooling operations with different kernel sizes are used to comprehensively capture and extract features. The feature map transformation process of this backbone network is the same as that of the YOLOv8 backbone network; so, the subsequent relevant feature fusion structures do not need to undergo significant adjustments, thus shortening our development process. Meanwhile, due to the lightweight design and efficient feature extraction ability of Mobile-Nano, we also obtained a lighter model and higher accuracy.

2.5. Lightweight Detection Head

Compared with YOLOv5, YOLOv8 replaces the coupled head with a decoupled head. This brings an improvement in accuracy but also leads to a sharp increase in network complexity and the number of parameters [31,32]. As shown in Figure 4a, both the upper and lower branches of the decoupled head in YOLOv8 are composed of three convolutional layers. However, the high-channel feature maps located in the top layer of the network result in an extremely large number of parameters and computational complexity. Moreover, we believe that multiple stacked convolutional layers are actually overly redundant. Therefore, we only retain the last convolutional layer, thus achieving a lightweight detection head, with only a slight decrease in accuracy. This lightweight detection head is referred to as LDtect in the following text. As shown in Table 7, the model equipped with the original non-simplified detection head has 2.76 M parameters and 7.7 G FLOPs. After replacing it with LDect, the parameter count is reduced to 2.13 M, and the FLOPs decrease to 5.9 G, representing reductions of 22.83% and 28.57%, respectively. Meanwhile, the drop in accuracy remains within an acceptable range.

2.6. Dysample

The specific process of Dysample [28] is shown in Figure 5.

Given a feature map

X

with dimensions C × H × W (C, H, and W represent the channel, height, and width of the feature map, respectively) and a sampling set

S

of size 2g × sH × sW (s and g are manually defined parameters), the grid sampling function uses the positions in

S

to resample the assumed bilinear interpolation into a new feature map

X^{'}

with dimensions C × sH × sW. This process is defined as

X^{'} = grid_sample (X, S) .

(1)

Given an upsampling factor s and a feature map

X

of size C×H×W, a linear layer with input and output channels C and

2 g s^{2}

, respectively, is used to generate an offset

O

of size

2 g s^{2}

× sH × sW. This offset is then reshaped to 2g × sH × sW using pixel shuffling. The sampling set

S

is obtained by adding the offset

O

to the original sampling grid

G

, as follows:

O = linear (X),

(2)

S = G + O .

(3)

2.7. Haar Wavelet Downsampling

As shown in Figure 6, the HWD module [29] used in this paper consists of two blocks: (1) a lossless feature encoding block and (2) a feature representation learning block.

The lossless feature encoding block is responsible for transforming features and reducing spatial resolution. To achieve this, researchers leverage the Haar wavelet transform, a method that efficiently reduces the resolution of feature maps while retaining all information.

The representation learning block consists of a standard convolutional layer, batch normalization, and a ReLU activation layer. It is used to extract discriminative features. By using the wavelet downsampling module, we are able to preserve as much information as possible during downsampling, which enhances the feature fusion process and improves the accuracy.

2.8. Evaluation Metrics

In tasks related to object detection, a set of evaluation metrics is essential. This paper adopts metrics such as the precision, recall, F1-score, and mAP. Precision is defined by the following formula:

Precision = \frac{T P}{T P + F P} .

(4)

Here,

T P

represents the number of correctly detected samples, while

F P

represents the number of samples where the negative class is mistakenly identified as the positive class. Recall is defined by the following formula:

Recall = \frac{T P}{T P + F N} .

(5)

Here,

F N

represents the number of samples where the positive class is erroneously identified as the negative class. Each class has its corresponding precision and recall values. Plotting these two metrics on the same graph yields a precision–recall (P-R) curve. The area under the P-R curve is termed AP, signifying the precision of the object detection algorithm. The mAP is then calculated by averaging the AP values for each class, as shown in the following formula:

mAP = \frac{\sum_{1}^{N} A P}{N} .

(6)

Here, N represents the total number of classes. FPS (frames per second) is a crucial metric for measuring the speed of a model in computer vision tasks, indicating the number of image frames processed per second. In real-time systems, such as object detection or tracking in video streams, FPS serves as a key performance indicator. The calculation of FPS involves measuring the time it takes for the model to process each frame and then determining how many frames can be processed per second. The formula for FPS is as follows:

F P S = \frac{1}{T_{one frame}},

(7)

T_{one frame} = t_{preprocess} + t_{inference} + t_{postproces} .

(8)

Here,

T_{one frame}

represents the total time required to process a single image, while

t_{preprocess}

,

t_{inference}

, and

t_{postproces}

denote the time spent on pre-processing, inference, and post-processing, respectively.

3. Experiment and Results

3.1. Training Settings

In the domain of deep learning, the choice of hardware and software can substantially influence experimental outcomes, leading to variations in performance. To facilitate reproducibility and enable other researchers to replicate our experiments, we provide the configuration details utilized in our study, with the results summarized in Table 4.

Furthermore, the model’s hyperparameters can also affect the experimental results to varying extents [33,34]. In our enhanced YOLOv8 model, we trained the network for 200 epochs using a batch size of 16. All input images were resized to 640 × 640 pixels before being fed into the model. The optimizer was set to “auto,” with a random seed of 0, and the close_mosaic parameter was configured to 10.

3.2. Comparison Experiments of Common Lightweight Methods

Currently, there is a large body of research on the lightweight optimization of object detection networks [35,36,37,38,39]. Among the most representative approaches are replacing the existing backbone networks with MobileNetv3 and ShuffleNetv2. To highlight the superiority of the backbone network proposed in our Mobile-YOLO, we followed the open-source code provided in [40] to replace the backbone network of YOLOv8, constructing YOLOv8-MobileNetv3. We then referred to the open-source code in [41] to build the more lightweight YOLOv8-ShuffleNetv2. Finally, we compared these models with our Mobile-YOLO, and the results are presented in Table 5.

It can be observed that YOLOv8-MobileNetv3 has the worst lightweight performance, with the lowest mAP and FPS, while its Params and FLOPs are the highest. YOLOv8-ShuffleNetv2 performs better in terms of lightweight optimization, showing excellent results in Params, FLOPs, and FPS. However, our original YOLOv8 already achieves a high mAP value of 80.5%, meaning this lightweight approach significantly impacts accuracy. On the other hand, although our Mobile-YOLO does not perform the best in terms of Params and FLOPs, it achieves a high FPS and an improved mAP compared to YOLOv8. Our method demonstrates a balance between lightweight optimization and accuracy, highlighting the effectiveness of our approach.

In many existing studies, using MobileNetv3 for model lightweighting often yields excellent results. For instance, in [42], the proposed model was trained and evaluated on a wheat FHB dataset, achieving a high mAP of 97.15% with only 3.64 million parameters and 4.77 billion FLOPs. Compared to the original YOLOv5, these values were reduced by 49.72% and 71.32%, respectively. However, the dataset used in this study was relatively simple, involving only binary classification. As a result, even with significant model lightweighting, it was still possible to retain relatively high accuracy in such straightforward tasks. In contrast, our dataset involves four categories, and the complex underwater environment introduces challenges such as light scattering and a cluttered background, with the targets often exhibiting clustering behaviors. When the number of target categories is too large or when the differences between categories are significant, the model may struggle to effectively learn the features of each category. Our dataset contains morphologically diverse underwater organisms, with targets varying in color, texture, or size. These diverse targets make it difficult for the model to extract features that can sufficiently distinguish between categories. Moreover, the dataset’s background is highly complex, containing many distracting elements or noise, making it harder for the model to extract relevant target information. This background complexity includes natural scenes, cluttered objects, lighting changes, or motion blur, all of which can interfere with target detection, especially when the target-background contrast is low or there is significant occlusion. These factors make our dataset particularly challenging, which is why using MobileNetv3 for lightweighting resulted in poor performance.

ShuffleNetv2 is also widely used in lightweight model research. A previous study [43] reported that YOLO-WDNet reduced the number of parameters by 82.3%, the model size by 91.6%, and the FLOPs by 82.3%. The proposed method improved the detection mAP by 9.1%, reduced the inference time by 57.14%, and outperformed state-of-the-art algorithms in weed detection. However, similar to MobileNetv3, the dataset used in this study involved binary classification and had relatively low scene noise. One reason ShuffleNetv2 performed poorly on our dataset is the complexity of the data. Another reason is the model’s simplicity, which made it incapable of efficiently extracting and processing features from complex data. A model that is too simple often lacks the capacity to learn the intricate features of complex datasets. Object detection tasks involve extracting features from images and accurately locating and classifying targets, which require a high-performing model. A simple model may fail to capture high-level features such as complex backgrounds, fine details, or varying scales, especially when dealing with tasks involving significant changes in target size or complex image backgrounds. Consequently, a simple model may struggle with such tasks, leading to poor detection performance. Additionally, if the model architecture is too shallow, it may lose significant detail during feature extraction, particularly lower- and mid-level features in the image. Shallow networks often struggle to handle complex visual tasks effectively. The lightweight approaches based on ShuffleNetv2 typically result in shallow networks, which also contributed to the low accuracy observed in our results. In contrast, our Mobile-YOLO model strikes a better balance between accuracy and lightweighting. It achieves high precision while maintaining a lightweight architecture, making it an effective solution for aquatic organism recognition tasks in challenging environments.

3.3. Performance Experiments of Haar Wavelet Downsampling Activation Functions

To achieve better aquaculture monitoring performance while performing lightweight optimization on the model, we conducted a performance comparison of the wavelet downsampling module with various activation functions. The final model tested was our Mobile-YOLO, with the only difference being the activation function used. The test results are shown in Table 6.

The experimental results indicate that the wavelet downsampling module with the ReLU activation function achieves the highest mAP50 and mAP50-95, as well as the highest FPS. The FPS is the highest for ReLU compared to other activation functions because ReLU is relatively simple. In contrast, the Tanh activation function, due to its complexity, results in the lowest FPS. The other activation functions show varying mAP50 and mAP50-95 values, with FPS remaining similar across them. Due to the outstanding performance of ReLU, we used the ReLU activation function in the wavelet downsampling module for the experiments described in the following sections.

3.4. Ablation Experiments

Table 7 presents the results of the ablation experiments, showing the impact of four components—Mobile-Nano, LDtect, Dysample, and HWD—on the lightweight performance of the model.

The experimental results indicate that the original YOLOv8 achieves the lowest mAP, suggesting that its feature extraction capability is inferior to that of our improved model. Additionally, the original YOLOv8 has the highest number of parameters, the highest complexity, the largest weight file, and the lowest FPS, indicating that it is the most complex model. After replacing the backbone network with Mobile-Nano, the model’s accuracy, i.e., the mAP, improved to 81.9%, while other lightweight metrics were also enhanced. Replacing the detection head with LDtect led to a slight decrease in mAP. This accuracy loss is the cost of reducing the three convolutions in the original detection head to one; however, the lightweight effect was significant, with Params reduced by 29.2%, FLOPs decreased by 27.2%, FPS increased by 51.9%, and Weights reduced by 28.0%. Continuing with the replacement of the feature fusion module’s upsampling block with Dysample, our model’s accuracy increased, but the lightweight performance slightly declined. Finally, replacing the downsampling block in the feature fusion module with HWD achieved the highest accuracy, with an mAP of 82.1%. At the same time, the model achieved the lightest performance, with Params reduced by 32.2%, FLOPs reduced by 28.4%, FPS increased by 95.2%, and Weights reduced by 30.8%.

Overall, our model strikes a balance between accuracy and lightweight performance, reaching an advanced level suitable for mobile devices for aquatic organism recognition.

3.5. Comparison Experiments with Popular Models

To further validate the performance advantages of our proposed model for aquatic organism recognition, we trained several popular neural networks [44,45,46,47,48,49,50,51,52,53,54], with the performance of different models shown in Table 8.

The results demonstrate that among these models, our Mobile-YOLO achieves the highest accuracy and FPS, while the Params, FLOPs, and Weights metrics are the second-best. The model with the smallest Params is YOLOv9t; however, its FPS is relatively low in this comparison. The model with the smallest FLOPs is EfficientDet, but its accuracy is quite low, with an mAP of only 56.42%. This is because our dataset contains a significant amount of underwater noise, and the EfficientDet model struggles to capture key features, mistaking noise for targets, leading to lower accuracy. YOLOv5n has the smallest Weights, and its other lightweight metrics also perform well, but its accuracy is not high enough.

Among these popular models, our Mobile-YOLO achieves high accuracy along with excellent lightweight metrics, highlighting its significant potential for aquatic organism recognition applications.

3.6. Model Detection Performance

The primary function of heatmaps is to provide interpretability for deep learning models, helping us understand the basis behind the model’s decision-making. In the visualized heatmaps, the intensity of the colors indicates which parts of the image contributed to the prediction of a specific classification label. Figure 7 illustrates several heatmaps generated from our detection results.

In the first column, based on the ground-truth labels, it is clear that the YOLOv8 model failed to focus on two of the objects, as these objects are represented with a color intensity of zero in the heatmap. This indicates that the model did not successfully attend to these two objects, resulting in missed detections. This issue primarily arose because the objects blended into the background, making them hard to distinguish. Specifically, the model misclassified a starfish as a rock and an urchin as a shadowed area formed between rocks. In contrast, Mobile-YOLO successfully attended to all four objects in the image, differentiating the targets from the background and effectively avoiding missed detections. In the second column, the YOLOv8 model showed insufficient focus on the target, which could result in lower confidence scores since the model’s certainty about the target’s presence would be reduced. Furthermore, although the ground truth indicates only one target, the model generated additional attention outside of the target, suggesting a false positive. The lower color intensity in the heatmap is due to the similarity between the sea cucumber and the soil, making it difficult to distinguish. While the model did identify the target, the low certainty led to a lower confidence score. We believe that this false positive was caused by the model’s inadequate feature extraction capabilities and significant background interference, both contributing to the error. In contrast, the heatmap generated by our Mobile-YOLO model displayed deeper colors, predominantly red, indicating a much stronger focus on the target, which results in higher confidence in the detection outcome. Additionally, our model did not generate any extra attention outside of the ground-truth label, thereby avoiding false positives. This demonstrates the model’s superior feature extraction ability and its robust capability to distinguish between objects and the background. In the third column, neither our model nor the YOLOv8 model successfully attended to all the targets, and both incorrectly generated attention on the background, leading to varying degrees of false positives.

Figure 8 illustrates the detection results of different models.

From Figure 8, it can be observed that in the first column, the YOLOv8 model exhibited a significant number of missed detections, while our model successfully detected all the targets. The number of missed detections in Figure 8 is inconsistent with those in Figure 7, where Figure 7 shows more missed detections, though in reality, only one target was missed. This discrepancy might be due to less distinct color mapping, causing the target to blend into the background. However, the deeper color in our model’s heatmap indicates that it has a higher level of confidence for this target, as evidenced by the higher confidence scores in Figure 8. In the second column, the YOLOv8 model produced a false detection, whereas our model avoided this issue. Furthermore, the deeper color of the target in Figure 7 corresponds to the higher confidence score shown in the detection result in Figure 8. In the third column, neither model successfully detected the target. We believe this is due to the presence of noise. In the future, we plan to investigate methods to denoise underwater images to ensure more accurate detection results. Finally, in Figure 9, we present several detection results from our model.

In Figure 10, we present several failure cases. In the first column of comparison images, our model successfully identified all true targets but mistakenly classified a portion of the background as potential objects. A similar issue occurred in the second column, where false positives and missed detections were observed. The third column also demonstrates a case of missed detection. We attribute these failures to the visual similarity between the background and the target objects. For instance, shadows or other background elements may share similar shapes and colors with Echinus, leading the model to make inevitable false or missed detections—errors that even experienced fishermen might encounter. In the missed detection shown in the second column, the target object exhibits relatively distinct features, which suggests that the feature extraction capability of our model still has room for improvement. In future work, we plan to explore more effective feature extraction architectures to reduce the occurrence of such failures.

4. Statistical Tests

In our study, the model achieves improved accuracy while maintaining a lightweight design. To verify that the observed improvement in accuracy is statistically significant [55], we conducted a paired t-test. We collected the mAP values of YOLOv8 and Mobile-YOLO under 10 different random seeds and calculated the corresponding p-value. As shown in Table 9, the resulting p-value is extremely small, approaching zero, and well below the significance threshold of 0.05 [56]. Therefore, from a statistical perspective, the accuracy improvement achieved in our study is considered statistically significant.

5. Conclusions and Future Work

In response to the challenge of real-time aquatic organism recognition in underwater environments, this paper proposes a lightweight yet highly accurate Mobile-YOLO model. We introduced the lightweight Mobile-Nano backbone network, which is more efficient and capable of stronger feature extraction compared to the original YOLOv8 backbone. We simplified YOLOv8’s complex detection head by optimizing its redundant convolutional structures, resulting in a more lightweight detection head, LDtect. Additionally, we optimized the feature fusion structure of YOLOv8 by incorporating a more advanced upsampling strategy, Dysample, which designs upsampling from a point-sampling perspective, enhancing the resource efficiency and yielding impressive performance improvements. We also employed Haar wavelet downsampling (HWD) to reduce the spatial resolution of feature maps while preserving as much information as possible. Our model achieved an mAP of 82.1%, an FPS of 416.7, with only 2.04 M Params, 5.8 G FLOPs, and a weight size of 4.13 MB. These results indicate that our model is well-suited for mobile deployment and contributes significantly to the research on lightweight models for aquatic organism recognition. Meanwhile, the improvement in accuracy contributes to producing higher-quality results in tasks such as marine species monitoring, conservation efforts, and biodiversity assessment.

The method proposed in this study demonstrates not only strong performance metrics in experimental evaluations but also considerable practical value and engineering feasibility. In future work, this approach can be further extended to support a broader range of object categories and multimodal information fusion, thereby enhancing its environmental adaptability and detection robustness. With the continued advancement of embedded hardware and edge computing technologies, the proposed method holds great promise for deployment on intelligent underwater equipment, unmanned surface platforms, and remote environmental monitoring systems, enabling real-time and efficient automated perception tasks. Moreover, it offers wide application potential in areas such as marine ecological monitoring, intelligent aquaculture management, autonomous underwater robotic fishing, and aquatic species population studies, indicating a broad development prospect and strong potential for real-world adoption.

Our study still has certain limitations. First, the dataset we used contains only a little over 5000 images, and there is room for further expansion. Demonstrating the mAP improvement of our model on a larger dataset would provide more statistically significant results. Second, we lack practical deployment experiments. In future work, we will focus on addressing these two aspects. Finally, the mAP of our model was improved by 1.6%; however, due to time constraints, we were unable to conduct a statistical significance analysis. In future work, we plan to strengthen our investigation in this area.

In future work, we plan to explore how to leverage large amounts of unannotated RGB images for depth estimation, as well as develop data augmentation techniques to improve underwater image quality.

Author Contributions

Conceptualization, H.J.; methodology, H.J.; project administration, H.J.; resources, H.J. and F.M.; software, J.Z.; validation, J.Z.; investigation, F.M., J.Z., H.J. and Y.Y.; visualization, H.J.; writing—original draft preparation, H.J.; writing—review and editing, R.Y. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Research and Application of a Digital Overall Design Method for Underwater Robots (2023-Z26).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Acknowledgments

Thank you to junior colleague Yan Yang for her suggestions and assistance with the revision of this manuscript.The authors thank the editor and anonymous reviewers for providing helpful suggestions for improving the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, Y.; Xiang, J.; Li, X.; Yang, C. An intelligent fishery detection method based on cross-domain image feature fusion. Fishes 2024, 9, 338. [Google Scholar] [CrossRef]
Peddina, K.; Mandava, A.K. The intelligent object detection framework for detecting fish from underwater images. Int. J. Commun. Netw. Distrib. Syst. 2025, 31, 63–88. [Google Scholar] [CrossRef]
Li, J.; Zhu, K.; Wang, F.; Jiang, F. Deep neural network-based real time fish detection method in the scene of marine fishing supervision. J. Intell. Fuzzy Syst. 2021, 41, 4527–4532. [Google Scholar] [CrossRef]
Labao, A.B.; Naval, P.C., Jr. Cascaded deep network systems with linked ensemble components for underwater fish detection in the wild. Ecol. Inform. 2019, 52, 103–121. [Google Scholar] [CrossRef]
Li, H.; Yu, H.; Gao, H.; Zhang, P.; Wei, S.; Xu, J.; Cheng, S.; Wu, J. Robust detection of farmed fish by fusing YOLOv5 with DCM and ATM. Aquac. Eng. 2022, 99, 102301. [Google Scholar] [CrossRef]
Levy, D.; Belfer, Y.; Osherov, E.; Bigal, E.; Scheinin, A.P.; Nativ, H.; Tchernov, D.; Treibitz, T. Automated analysis of marine video with limited data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1385–1393. [Google Scholar]
Arvind, C.; Prajwal, R.; Bhat, P.N.; Sreedevi, A.; Prabhudeva, K. Fish detection and tracking in pisciculture environment using deep instance segmentation. In Proceedings of the TENCON 2019—2019 IEEE Region 10 Conference (TENCON), Kochi, India, 17–20 October 2019; pp. 778–783. [Google Scholar]
Monkman, G.G.; Hyder, K.; Kaiser, M.J.; Vidal, F.P. Using machine vision to estimate fish length from images using regional convolutional neural networks. Methods Ecol. Evol. 2019, 10, 2045–2056. [Google Scholar] [CrossRef]
Hu, X.; Liu, Y.; Zhao, Z.; Liu, J.; Yang, X.; Sun, C.; Chen, S.; Li, B.; Zhou, C. Real-time detection of uneaten feed pellets in underwater images for aquaculture using an improved YOLO-V4 network. Comput. Electron. Agric. 2021, 185, 106135. [Google Scholar] [CrossRef]
Li, Q.; Li, Y.; Niu, J. Real-time detection of underwater fish based on improved YOLO and transfer learning. Pattern Recognit. Artif. Intell. 2019, 32, 193–203. [Google Scholar] [CrossRef]
Cai, K.; Miao, X.; Wang, W.; Pang, H.; Liu, Y.; Song, J. A modified YOLOv3 model for fish detection based on MobileNetv1 as backbone. Aquac. Eng. 2020, 91, 102117. [Google Scholar] [CrossRef]
Ren, W.; Shi, D.; Chen, Y.; Song, L.; Hu, Q.; Wang, M. A lightweight fine-grained pelagic fish recognition algorithm based on object detection. Aquac. Int. 2025, 33, 96. [Google Scholar] [CrossRef]
Wang, J.; Yin, X.; Li, G. A Real-Time Lightweight Detection Algorithm for Deck Crew and the Use of Fishing Nets Based on Improved YOLOv5s Network. Fishes 2023, 8, 376. [Google Scholar] [CrossRef]
Wu, F.; Zhang, Y.; Wang, L.; Hu, Q.; Fan, S.; Cai, W. A deep learning-based lightweight model for the detection of marine fishes. J. Mar. Sci. Eng. 2023, 11, 2156. [Google Scholar] [CrossRef]
Chen, N.; Zhu, J.; Zheng, L. Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection. Appl. Sci. 2024, 14, 6461. [Google Scholar] [CrossRef]
Ruan, Z.; Wang, Z.; He, Y. DeformableFishNet: A high-precision lightweight target detector for underwater fish identification. Front. Mar. Sci. 2024, 11, 1424619. [Google Scholar] [CrossRef]
Zhou, C.; Wang, C.; Sun, D.; Hu, J.; Ye, H. An automated lightweight approach for detecting dead fish in a recirculating aquaculture system. Aquaculture 2025, 594, 741433. [Google Scholar] [CrossRef]
Huang, J.; Zhang, T.; Zhao, S.; Zhang, L.; Zhou, Y. An underwater organism image dataset and a lightweight module designed for object detection networks. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–23. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, B.; Hu, Y.; Li, C.; Li, Y. Accurate cotton diseases and pests detection in complex background based on an improved YOLOX model. Comput. Electron. Agric. 2022, 203, 107484. [Google Scholar] [CrossRef]
Song, C.Y.; Zhang, F.; Li, J.S.; Xie, J.Y.; Chen, Y.; Hang, Z.; Zhang, J.X. Detection of maize tassels for UAV remote sensing image with an improved YOLOX model. J. Integr. Agric. 2023, 22, 1671–1683. [Google Scholar] [CrossRef]
Yang, S.; Wang, W.; Gao, S.; Deng, Z. Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 215, 108360. [Google Scholar] [CrossRef]
Liu, C.; Feng, Q.; Sun, Y.; Li, Y.; Ru, M.; Xu, L. YOLACTFusion: An instance segmentation method for RGB-NIR multimodal image fusion based on an attention mechanism. Comput. Electron. Agric. 2023, 213, 108186. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, S.; Lu, J.; Wang, H.; Feng, Y.; Shi, C.; Li, D.; Zhao, R. A lightweight dead fish detection method based on deformable convolution and YOLOV4. Comput. Electron. Agric. 2022, 198, 107098. [Google Scholar] [CrossRef]
Liu, Z.; Abeyrathna, R.R.D.; Sampurno, R.M.; Nakaguchi, V.M.; Ahamed, T. Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
Li, H.; Gu, Z.; He, D.; Wang, X.; Huang, J.; Mo, Y.; Li, P.; Huang, Z.; Wu, F. A lightweight improved YOLOv5s model and its deployment for detecting pitaya fruits in daytime and nighttime light-supplement environments. Comput. Electron. Agric. 2024, 220, 108914. [Google Scholar] [CrossRef]
Yu, C.; Feng, J.; Zheng, Z.; Guo, J.; Hu, Y. A lightweight SOD-YOLOv5n model-based winter jujube detection and counting method deployed on Android. Comput. Electron. Agric. 2024, 218, 108701. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 78–96. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Qiu, M.; Huang, L.; Tang, B.H. Bridge detection method for HSRRSIs based on YOLOv5 with a decoupled head. Int. J. Digit. Earth 2023, 16, 113–129. [Google Scholar] [CrossRef]
Wang, H.; Jin, Y.; Ke, H.; Zhang, X. DDH-YOLOv5: Improved YOLOv5 based on Double IoU-aware Decoupled Head for object detection. J. Real-Time Image Process. 2022, 19, 1023–1033. [Google Scholar] [CrossRef]
Liao, L.; Li, H.; Shang, W.; Ma, L. An empirical study of the impact of hyperparameter tuning and model optimization on the performance properties of deep neural networks. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2022, 31, 1–40. [Google Scholar] [CrossRef]
Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
Cheng, T.; Zhang, D.; Gu, C.; Zhou, X.G.; Qiao, H.; Guo, W.; Niu, Z.; Xie, J.; Yang, X. YOLO-CG-HS: A lightweight spore detection method for wheat airborne fungal pathogens. Comput. Electron. Agric. 2024, 227, 109544. [Google Scholar] [CrossRef]
Shang, Y.; Xu, X.; Jiao, Y.; Wang, Z.; Hua, Z.; Song, H. Using lightweight deep learning algorithm for real-time detection of apple flowers in natural environments. Comput. Electron. Agric. 2023, 207, 107765. [Google Scholar] [CrossRef]
Cui, J.; Zheng, H.; Zeng, Z.; Yang, Y.; Ma, R.; Tian, Y.; Tan, J.; Feng, X.; Qi, L. Real-time missing seedling counting in paddy fields based on lightweight network and tracking-by-detection algorithm. Comput. Electron. Agric. 2023, 212, 108045. [Google Scholar] [CrossRef]
Quan, S.; Wang, J.; Jia, Z.; Xu, Q.; Yang, M. Real-time field disease identification based on a lightweight model. Comput. Electron. Agric. 2024, 226, 109467. [Google Scholar] [CrossRef]
Lin, Y.; Wang, L.; Chen, T.; Liu, Y.; Zhang, L. Monitoring system for peanut leaf disease based on a lightweight deep learning model. Comput. Electron. Agric. 2024, 222, 109055. [Google Scholar] [CrossRef]
Lu, J.; Chen, P.; Yu, C.; Lan, Y.; Yu, L.; Yang, R.; Niu, H.; Chang, H.; Yuan, J.; Wang, L. Lightweight green citrus fruit detection method for practical environmental applications. Comput. Electron. Agric. 2023, 215, 108205. [Google Scholar] [CrossRef]
HuKai97. YOLOv5-ShuffleNetv2. 2022. Available online: https://github.com/HuKai97/YOLOv5-ShuffleNetv2 (accessed on 29 October 2024).
Gao, C.; Guo, W.; Yang, C.; Gong, Z.; Yue, J.; Fu, Y.; Feng, H. A fast and lightweight detection model for wheat fusarium head blight spikes in natural environments. Comput. Electron. Agric. 2024, 216, 108484. [Google Scholar] [CrossRef]
Fan, X.; Sun, T.; Chai, X.; Zhou, J. YOLO-WDNet: A lightweight and accurate model for weeds detection in cotton field. Comput. Electron. Agric. 2024, 225, 109317. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-time vehicle detection based on improved YOLOv5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning what you want to learn using programmable gradient information. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Zhao, Z.; Chen, S.; Ge, Y.; Yang, P.; Wang, Y.; Song, Y. Rt-detr-tomato: Tomato target detection algorithm based on improved rt-detr for agricultural safety production. Appl. Sci. 2024, 14, 6287. [Google Scholar] [CrossRef]
Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef] [PubMed]
Zeng, P.; Song, R.; Lin, Y.; Li, H.; Chen, S.; Shi, M.; Cai, G.; Gong, Z.; Huang, K.; Chen, Z. Abnormal maxillary sinus diagnosing on CBCT images via object detection and ‘straight-forward’classification deep learning strategy. J. Oral Rehabil. 2023, 50, 1465–1480. [Google Scholar] [CrossRef]

Figure 1. The architecture diagram of the YOLOv8 network.

Figure 2. The architecture diagram of the Mobile-YOLO network.

Figure 3. Core components of MobileNetv4. (a) Universal IB block with two optional DW (b) Possible instantiations of the UIB block (c) Alternative Fused IB.

Figure 4. (a) Decoupled head of YOLOv8. (b) Lightweight detection head.

Figure 5. (a) Sampling-based dynamic upsampling. (b) Sampling point generator in DySample.

Figure 6. Haar wavelet downsampling module.

Figure 7. (a) Ground- truth labels. (b) YOLOv8 model’s heatmap. (c) Our Mobile-YOLO model’s heatmap.

Figure 8. (a) Ground- truth labels. (b) YOLOv8 detection results. (c) Mobile-YOLO detection results.

Figure 9. Partial detection results of Mobile-YOLO.

Figure 10. Presentation of selected failure cases. (a) Ground truth. (b) Detection results of our model.

Table 1. Dataset distribution.

Annotation_Label	Bounding_Box
echinus	17,735
holothurian	4488
scallop	5345
starfish	5391

Table 2. Dataset partitioning.

Dataset	Images
training_set	3880
validation_set	554
test_set	1109

Table 3. Detailed design parameters of Mobile-Nano.

Input	Block	DW K1	DW K2	Expanded Dim	Output Dim	Stride
$640^{2} \times 3$	Conv2d	-	$3 \times 3$	-	16	2
$320^{2} \times 16$	Fused_IB	-	$3 \times 3$	-	32	2
$160^{2} \times 32$	ExtraDW	$3 \times 3$	$3 \times 3$	128	32	1
$160^{2} \times 32$	ExtraDW	$3 \times 3$	$5 \times 5$	192	64	2
$80^{2} \times 64$	ExtraDW	$3 \times 3$	$3 \times 3$	256	64	1
$80^{2} \times 64$	ExtraDW	$3 \times 3$	$5 \times 5$	384	128	2
$40^{2} \times 128$	IB	-	$3 \times 3$	256	128	1
$40^{2} \times 128$	IB	-	$3 \times 3$	384	256	2
$20^{2} \times 256$	ConvNEXT	$3 \times 3$	-	1024	256	1
$20^{2} \times 256$	SPFF	-	-	-	256	-

Table 4. Computer parameters.

Configuration	Parameter
Operating system	Ubuntu
Accelerated environment	Cuda 11.3
Language	Python 3.8
Framework	Pytorch 1.11.0
CPU	14 vCPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
GPU	RTX3090 (24 GB)

Table 5. Comparison experiments of common lightweight methods (mAP represents the average detection accuracy of the model across all object categories. Params indicate the total number of trainable parameters in the model. FLOPs refer to the total number of floating point operations required for a single forward inference. FPS denotes the number of image frames the model can process per second).

Model	mAP50 (%)	Params (M)	FLOPs (G)	FPS
YOLOv8-Mobilenetv3	75.3	5.02	10.7	303.0
YOLOv8-Shufflenetv2	78.5	1.07	4.6	344.8
Mobile-YOLO	82.1	2.04	5.8	476.2

Table 6. Performance experiments of Haar wavelet downsampling activation functions (mAP represents the average detection accuracy of the model across all object categories. Params indicate the total number of trainable parameters in the model. FLOPs refer to the total number of floating point operations required for a single forward inference. FPS denotes the number of image frames the model can process per second. Weights refer to the storage size of the model’s parameter file).

Activation Function	mAP50 (%)	mAP50-95 (%)	Params (M)	FLOPs (G)	FPS	Weights (M)
GELU	82.1	46.1	2.04	5.8	434.8	4.13
Leaky ReLU	81.6	46	2.04	5.8	434.8	4.13
SiLU	81	45.8	2.04	5.8	434.8	4.13
Sigmoid	81.9	46.4	2.04	5.8	454.5	4.13
Swish	81.6	45.9	2.04	5.8	434.8	4.13
ELU	82	46.3	2.04	5.8	434.8	4.13
SELU	81.7	45.9	2.04	5.8	434.8	4.13
Tanh	81.7	46	2.04	5.8	416.7	4.13
ReLU	82.1	46.4	2.04	5.8	476.2	4.13

Table 7. Ablation experiment (mAP represents the average detection accuracy of the model across all object categories. Params indicate the total number of trainable parameters in the model. FLOPs refer to the total number of floating point operations required for a single forward inference. FPS denotes the number of image frames the model can process per second. Weights refer to the storage size of the model’s parameter file).

YOLOv8	Mobile-Nano	LDtect	Dysample	HWD	mAP50 (%)	Params (M)	FLOPs (G)	FPS	Weights (M)
✓					80.5	3.01	8.1	243.9	5.97
✓	✓				81.9	2.76	7.7	256.4	5.10
✓	✓	✓			81.3	2.13	5.9	370.4	4.30
✓	✓	✓	✓		82	2.14	5.9	344.8	4.32
✓	✓	✓	✓	✓	82.1	2.04	5.8	476.2	4.13

Table 8. Comparison experiments between different models (mAP represents the average detection accuracy of the model across all object categories. Params indicate the total number of trainable parameters in the model. FLOPs refer to the total number of floating point operations required for a single forward inference. FPS denotes the number of image frames the model can process per second. Weights refer to the storage size of the model’s parameter file).

Model	mAP50 (%)	Params (M)	FLOPs (G)	FPS	Weights (M)
YOLOv5n	80.3	2.19	5.9	383.1	3.87
YOLOv6n	79.3	41.63	11.6	322.6	9.34
YOLOv7	80.4	37.21	105.2	51.0	72.1
YOLOv8n	80.5	3.01	8.1	243.9	5.97
YOLOv9t	81.7	1.77	6.7	157.7	4.4
YOLOv10n	80.6	2.76	8.4	245.7	10.09
YOLO11n	81.3	2.93	6.5	303.0	5.35
YOLO12n	81.0	2.57	6.5	324.7	5.5
SSD	55.25	26.29	62.75	102.04	100
Efficientdet	56.42	3.87	5.23	341.3	15.1
Retinanet	72.29	38.00	170.1	38.84	140
Rtdetr-l	77.2	32.97	108.3	76.9	63.1
Mobile-YOLO	82.1	2.04	5.8	416.7	4.13

Table 9. Statistical tests on mAP (%) values obtained under different random seeds, where p < 0.05 indicates a statistically significant difference.

Model	Seed0	Seed1	Seed2	Seed3	Seed4	Seed5	Seed6	Seed7	Seed8	Seed9	p
YOLOv8n	80.5	81.6	80.8	80.5	81.4	81.2	80.8	81.6	81.7	80.9	-
Mobile-YOLO	82.1	83.6	83.4	82	83.5	83.7	83.4	83.3	83.1	82.2	0.00000071 < 0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, H.; Zhao, J.; Ma, F.; Yang, Y.; Yi, R. Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms. Fishes 2025, 10, 348. https://doi.org/10.3390/fishes10070348

AMA Style

Jiang H, Zhao J, Ma F, Yang Y, Yi R. Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms. Fishes. 2025; 10(7):348. https://doi.org/10.3390/fishes10070348

Chicago/Turabian Style

Jiang, Hanyu, Jing Zhao, Fuyu Ma, Yan Yang, and Ruiwen Yi. 2025. "Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms" Fishes 10, no. 7: 348. https://doi.org/10.3390/fishes10070348

APA Style

Jiang, H., Zhao, J., Ma, F., Yang, Y., & Yi, R. (2025). Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms. Fishes, 10(7), 348. https://doi.org/10.3390/fishes10070348

Article Menu

Mobile-YOLO: A Lightweight Object Detection Algorithm for Four Categories of Aquatic Organisms

Simple Summary

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Introduction

2.2. YOLOv8 Object Detection Network

2.3. Mobile-YOLO Object Detection Network

2.4. Mobile-Nano Backbone Network

2.5. Lightweight Detection Head

2.6. Dysample

2.7. Haar Wavelet Downsampling

2.8. Evaluation Metrics

3. Experiment and Results

3.1. Training Settings

3.2. Comparison Experiments of Common Lightweight Methods

3.3. Performance Experiments of Haar Wavelet Downsampling Activation Functions

3.4. Ablation Experiments

3.5. Comparison Experiments with Popular Models

3.6. Model Detection Performance

4. Statistical Tests

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI