DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification

Hamzaoui, Mahdi; Rejili, Mokhtar; Aoueileyine, Mohamed Ould-Elhassen; Bouallegue, Ridha

doi:10.3390/app152010870

Open AccessArticle

DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification

¹

Innov’COM Laboratory, Higher School of Communication of Tunis, University of Carthage, Technopark Elghazala, Raoued, Ariana 2083, Tunisia

²

Department of Biology, College of Sciences, Al Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11623, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 10870; https://doi.org/10.3390/app152010870

Submission received: 24 August 2025 / Revised: 24 September 2025 / Accepted: 2 October 2025 / Published: 10 October 2025

(This article belongs to the Special Issue Advances in Aquatic Animal Nutrition and Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

The conservation and protection of fish species are crucial tasks for aquaculture and marine biology. Recognizing fish in underwater environments is highly challenging due to poor lighting and the visual similarity between fish and the background. Conventional recognition methods are extremely time-consuming and often yield unsatisfactory accuracy. This paper proposes a new method called DeepFishNET+. First, an Underwater Image Enhancement module was implemented for image correction. Second, Global CNN Stream (RestNet50) and a Local Transformer Stream were implemented to generate the Feature Map and Feature Vector. Next, a feature fusion operation was performed in the Cross-Attention Feature Fusion module. Finally, Yolov8 was used for fish detection and localization. Softmax was applied for species recognition. This new approach achieved a classification precision of 98.28% and a detection precision of 92.74%.

Keywords:

YOLOv8; transformers; computer vision; machine learning; fish classification and detection; aquaculture

1. Introduction

Aquaculture is a major contributor to global food security, particularly in developing regions. Farmed fish production reached 82.1 million tons in 2018, representing 52% of aquaculture production for human consumption, and rose to 94.4 million tonnes in 2022 [1,2]. In 2019, aquaculture generated USD 263.6 billion, representing 48% of global fish production [3]. With capture fisheries in decline, aquaculture provides a reliable source of high-quality proteins and a stable supply of aquatic products. Productivity depends on effective monitoring of underwater environments [4]. Conventional management requires direct farmer–fish interaction [5], which stresses fish, reduces growth, and increases disease risk. Automated recognition technologies enables continuous monitoring of fish health, growth, and behavior [6]. This proactive approach enables fish farmers to quickly identify problems related to fish health or overpopulation [7,8]. These systems also support sustainability by optimizing production and mitigating ecological impacts.

Underwater imagery and videos are often degraded by low light, turbidity, and complex backgrounds, making manual processing costly and error-prone. Traditional automatic methods lack robustness to dynamic conditions, and processing of underwater videos is a good alternative [9]. However, old processing methods are not very robust, given the constant evolution of lighting. The rise of artificial intelligence, particularly machine learning, offers better solutions. The fusion of deep learning techniques and computer vision means that they now provide accurate tools for fish detection, disease diagnosis, and weight prediction [10,11].

In order to overcome these limitations, we present a novel architecture called DeepFishNET+, integrating (i) an enhancement module for color and brightness corrections; (ii) a dual-stream module combining a global CNN context and a local Transformer for fine details; (iii) cross-attention fusion for robust feature representation; (iv) YOLOv8 localization with softmax classification for fish species identification.

A review of previous studies and their obtained results is presented in Section 2. Section 3 will present the data-processing phase and the techniques used to clean the images. We also focus on explaining the architecture of the proposed method. Section 4 presents the experimental results and evaluates the robustness of our new DeepFishNET+ model compared to previous work. The results found are discussed in Section 5 while comparing them to the results obtained in similar work. Finally, the last section concludes the study while opening up prospects for future improvements.

2. Related Work

Fish detection and classification are crucial for aquaculture productivity. Several studies have addressed underwater image and video enhancement to improve model robustness. Liu, Zhuoyan et al. proposed a plug-and-play module for detector models such as YOLOv5, trained without additional data, which improved detection by up to +3.3 Average Precision on the URPC2021 dataset [12]. Prabhavathy et al. combined attention modules (CBAMs), Swin Transformers, and scattering models to refine underwater image quality, achieving 81.4% detection accuracy on the TrashCan dataset [13]. Guan et al. introduced AUIE-GAN, an Adaptive Underwater Image Enhancement model based on Generative Adversarial Networks. The main objective of their study was to improve the images captured in marine environments. This adaptive GAN model was trained on the UIEB and SUID datasets, outperforming methods such as FUnIE-GAN (Fast Underwater Image Enhancement GAN) and PUIE-Net (Perceptual Underwater Image Enhancement Network) in terms of visual quality and objective metrics [14]. Similarly, Cong et al. proposed a model that merges a physical parameter prediction subnetwork with a dual-discriminator GAN, optimizing visual fidelity and enhancement performance [15]. A lightweight network based on efficient convolutions and channel merging mechanisms was proposed by Hao et al. and named DGC-UWnet. The aim was to improve the quality of underwater images while adapting to resource-constrained environments. In the test phase, DGC-UWnet was used as a data pre-processing method, and it significantly boosted YOLOv5 detection accuracy [16]. Chen et al. proposed a Collaborative Compensative Transformer Network for salient object detection, incorporing collaborative relation and compensative fusion modules to improve segmentation performance in complex environments where the object is very similar to the background [17]. Meng et al. introduced an RGB-D mode salient object detection model. This model integrates cross-attention and multi-level contextual interaction mechanisms fused with boundary guidance to refine contour accuracy. This method is particularly effective in low-visibility environments [18].

Traditional object detectors such as Faster R-CNN, YOLO, and SSD have been widely applied but often suffer from accuracy loss due to poor visibility and domain gaps between real and synthetic data. Domain adaptation and synthetic data generation have been proposed as remedies. For example, the ADOD (Adaptive Domain-Aware Object Detector) approach integrates residual attention modules and a domain classifier into YOLOv3, reaching a mean Average Precision (mAP) of 54.09% on the URPC2019 dataset [19]. EnYOLO merges image enhancement and object detection within a unified framework, facilitating deployment on low-resource autonomous underwater robots [20]. MARS (Multi-Scale Adaptive Robotics Vision) is a model proposed by Saad et al. that exploits multi-scale attention mechanisms and class-based domain adaptation, improving detection accuracy rates in different marine contexts [21]. Another method, GCC-Net (Gated Cross-domain Collaborative Network), adopts cross-domain collaboration by fusing original and enhanced features, achieving promising results on four underwater datasets [22]. A data-centric framework proposed by Folkman et al. evaluates several image quality enhancement techniques to minimize domain gaps, leading to gains of up to +8% in mAP over different aquatic environments [23]. Furthermore, the use of synthetically generated data in the Syn2Real (Synthetic to Real) approach, with the aid of diffusion models, has optimized the generalization of the Mask R-CNN, with an improvement of over 60% in Average Precision compared with training solely on real data [24]. Other lightweight solutions such as SWIPENET present a weighted learning strategy derived from curriculum learning, with significant results even in noisy conditions [25].

Transformer-based methods, such as Swin Transformer and ViT, have recently demonstrated robustness in capturing global dependencies, though their application in underwater fish detection remains limited. To overcome this shortcoming, several works have focused on hybrid architectures merging the power of CNNs for local feature extraction and Transformers for global contextual design. The HTDet model proposed by Chen et al. integrates MobileViT and MobileNetV2 to detect small, hard-to-localize objects in underwater images while maintaining a low computational footprint [26]. For its part, DyFish-DETR is a proposed architecture based on DETR (DEtection TRansformer) specifically for the detection of moving fish, incorporating a lightweight hybrid encoder that minimizes computational costs while improving accuracy [27]. In addition, another method based on the Swin Transformer was presented by Pavithra et al. for image segmentation and underwater object detection, exploiting the model’s power to handle long-range relationships while maintaining computational performance suitable for real-time performance [28]. FLSSNet (Forward-Looking Sonar Salient object detection Network) combined CNN and Transformer modules for underwater forward looking sonar image segmentation, demonstrating enhanced robustness in complex, low-visibility environments [29].

3. Materials and Methods

In this section, we begin by presenting the dataset and the processing steps involved, in particular the cleaning and bounding box generation operations. We then explain in detail the architecture of our DeepFishNET+ approach.

3.1. Dataset

The images were captured in the marine fishing port of Cabuyao City, Laguna, Philippines. The dataset was entitled the YOLO-ViT Dataset. It consists of 1956 images of fish collected under a variety of environmental conditions and distributed over 10 fish species. Table 1 shows the types of fish found and their distribution in the dataset. To improve data quality and make the model training phase more relevant, data pre-processing and analysis were carried out. Rotation simulates variations in real-world viewing angles, enabling the model to better adapt to real-world images. A rotation between −15° and +15° was applied to make the model more efficient. Images of underwater environments often suffer from attenuation and brightness, reducing contrast. A gray scale with a threshold of 0.5 was therefore added to the original images to improve contrast. Data augmentation techniques were applied to increase the diversity of the dataset, improve model generalization, and prevent overfitting. After augmentation, the total number of images in the dataset was 4332 images, with 3789 images used for training, 287 images for data validation, and 256 images for model robustness testing.

3.2. Proposed Method

The approach proposed in this work is hybrid, as illustrated in Figure 1. It consists of four main phases. The first step was designed to improve the quality of underwater images by implementing an Underwater Image Enhancement Network. In the second step, the two streams, Global CNN and the Local Transformer Stream, run in parallel and focus on the most relevant regions of the image. A third feature merging step was required. Finally, a Multi-Task Head module was created. It is based on two different heads. The first is designed to detect and locate fish objects, while the second uses softmax for species classification and recognition.

3.2.1. Underwater Image Enhancement Network (UIE-Net)

Given the problems associated with underwater environmental conditions, improving image quality is an essential phase in our work. As a first step, we implemented a module called UIE-Net, which is dedicated to remedying these problems. UIE-Net is based on a lightweight Generative Adversarial Network (GAN), designed to correct degradations caused by the aquatic environment. It performs spectral restoration to recover color losses, improves contrast for better distinction between objects of interest, and minimizes visual noise to stabilize the performance of downstream processing models. The main aim of this pre-processing is to transform the original RGB images, often altered by turbidity and light attenuation, into images closer to the visual quality obtained in terrestrial environments. We have opted for a lean GAN to guarantee efficient enhancement while maintaining reduced computational complexity. In order to obtain an objective assessment of the degree of transformation in image quality, we opted to use four indicators, PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), UIQM (Underwater Image Quality Measure), and UCIQE (Underwater Color Image Quality Evaluation):

PSNR (Peak Signal-to-Noise Ratio): This evaluates the fidelity of the restored image compared to the original reference. A higher value indicates better reconstruction quality.
SSIM (Structural Similarity Index): This measures the structural similarity between two images, taking into account contrast, brightness, and texture. The closer the value is to 1, the better the perceived quality.
UIQM (Underwater Image Quality Measure): This is a reference-free indicator, suitable for underwater images, which combines sharpness, contrast, and natural colors.
UCIQE (Underwater Color Image Quality Evaluation): This is an indicator designed for underwater environments, without reference, and based on the dispersion of chromaticity, contrast, and color saturation.

3.2.2. Dual-Stream Feature Extractor

In a second phase called the Dual-Stream Feature Extractor, we opted to implement two streams running in parallel. The first is the Global CNN Stream. This is based on a Resnet50 backbone. It is used specifically to capture the fishes’ overall shape, silhouette, and dominant features. It takes as input the image generated by UIE-Net to produce an appropriate feature map.

The second stream, called the Local Transformer Stream, uses the same image generated by UIE-Net. In order to capture global spatial relationships, such as correlation between different areas of the image even in the presence of noise, we used a Swin Transformer Tiny backbone. All these details and spatial relationships are generated in a feature vector.

After the parallel execution of the two streams, the Global CNN Flux and Local Transformer Stream, an intermediate integration step is necessary to prepare the features for the final merge. First, the BatchNorm normalization technique is performed to adjust the outputs of each stream. Next, a linear projection is performed to harmonize the dimensions of the extracted tensors. Where spatial resolutions differ, resizing or interpolation is applied to align the Feature Maps on the same spatial and temporal plane. This phase ensures that the outputs of both streams have compatible, homogeneous shapes, ready for optimal merging.

3.2.3. Cross-Attention Feature Fusion

Cross-Attention Feature Fusion is an essential step in the integration of information generated by the Global CNN and Local Transformer Stream. This module utilizes a cross-attention mechanism to cross-weight features extracted from the first stream, which focuses on global shapes and structures, and the second stream, which captures fine details and spatial relationships. This careful fusion enables the network to highlight discriminating areas of the image, such as fish fins, heads, and tails.

Initially, as shown in Figure 2, the Feature Map is processed by 1 × 1 convolution, followed by Batch Normalization and ReLU activation, to generate the query vector (Q). In parallel, the Spatial Features are divided into two branches: the first applies a 1 × 1 convolution, the second a 3 × 3 convolution, each followed by normalization and ReLU activation. The two outputs are then concatenated to form the key (K) and value (V) vectors. The scalar product between Q and the transpose of K is then normalized by a square root of the key dimension and passed to the softmax function, generating an attention map for weighting the importance of local regions from V, resulting in an Attention Output. To enhance the quality of relevant regions and attenuate noisy areas, a denoising block was added, consisting of a 1 × 1 convolution, a BatchNorm, and a ReLU. The end result of this weighted fusion is a Feature Fusion Map, rich in global and local information, ready to be exploited by the network’s multi-tasking output heads.

3.2.4. Multi-Task Head

The final phase of our method is the Multi-Task Head. It aims to simultaneously perform the tasks of detecting fishes in the image and classifying them by species. This module is based on a multi-branch architecture exploiting the merged deep features produced by the previous phases. As illustrated in Figure 3, the first branch is a head designed primarily for detection, based on the YOLOv8 architecture. This head uses an anchor-free technique to predict, for each spatial cell, the coordinates of the bounding boxes and the confidence score. The second branch is a classification head, which recognizes the species of fish detected. It uses a Global Average Pooling operation, followed by a fully connected layer and softmax activation, to generate a class prediction among the species found. The model is trained in a multi-tasking framework, minimizing a composite cost function: the weighted sum of detection and classification losses. This strategy enables the architecture to jointly optimize its fish species localization and recognition capabilities, while taking advantage of the cross-representations learned in previous modules.

3.3. Evaluation Metrics

In order to measure the performance of our model, we used the metrics Accuracy, Precision, Recall, and F1-score in the classification task. Precision, mAP50, and box-loss are used to measure the model’s capability in the task of detecting and locating objects. These metrics are based on the outcomes of the classification, specifically the counts of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs).

TPs (True Positives): The number of positive instances correctly classified as positive.
FPs (False Positives): The number of negative instances incorrectly classified as positive.
TNs (True Negatives): The number of negative instances correctly classified as negative.
FNs (False Negatives): The number of positive instances incorrectly classified as negative.

Accuracy (1) measures the overall correctness of the model by calculating the proportion of correctly classified instances both positive and negative out of the total instances. Precision (2), also known as Positive Predictive Value, measures the accuracy of the positive predictions. It is the proportion of true positive predictions out of all positive predictions, both true and false. Recall (3) measures the ability of the model to correctly identify all positive instances. It is the proportion of true positive predictions out of all actual positives. The F1-Score (4) is the harmonic mean of Precision and Recall. It balances the two metrics and is useful when you need to consider both false positives and false negatives.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN},

(1)

Precision = \frac{TP}{TP + FP},

(2)

Recall = \frac{TP}{TP + FN},

(3)

F 1 - Score = \frac{2 \times Precision \times Recall}{Precision + Recall},

(4)

3.4. Experimental Setup

All of our experiments were conducted under the same conditions on a remote server equipped with an Intel i7-12900K CPU and 18 GB of RAM, running Windows 11. The deep learning models were implemented in Python 3.10 using PyTorch 2.1 and executed on an NVIDIA RTX 3080 GPU with CUDA 12.1 support. Our local machine was connected to this server, and all experiments were performed using the same software and hardware environment to ensure consistency and reproducibility of the results. After augmentation, the dataset contained 4332 images of fish. The proposed model was trained over 50 epochs, using AdamW as the optimizer, chosen for its better generalization ability compared to the classic Adam optimizer. We opted for a refined adjustment of the learning rate in order to evolve gradually rather than starting directly with a high value. A warm-up phase was conducted during the first 10 epochs, during which the learning rate was linearly increased from 0 to 0.001. The learning rate was then reduced according to a cosine decay to a final value of

1 \times 10^{- 6}

over the remaining 45 epochs. Since large batch sizes produce more stable gradient estimates, which translate into more stable convergence, we opted for a batch size of 256. Figure 4A,B show the contents of batch 0 and batch 12, respectively, at the time of training. These contents represent how the model sees the images and their annotations during the learning phase. Table 2 shows the detailed configuration of the model.

4. Results

4.1. Validation of Results Obtained by UIE-Net

In the experimental phase, we tested the UIE-Net module separately. Figure 5 shows an example of an original image from the dataset, depicting a Big Head Carp fish. This figure also shows the enhanced version of the image after applying UIE-Net. In order to obtain a quantitative assessment of the improvements made, we measured the PSNR, SSIM, UIQM, and UCIQE values. Table 3 shows all of these values. Using the PSNP metric, for example, the value obtained for the original image is 18.2 decibels (dB). This value is improved to 25.6 dB after enhancement. Similarly, the SSIM value increases from 0.58 to 0.79, closer to the value 1.

4.2. Validation of Results Obtained by Dual-Stream Feature Extractor Module

As illustrated in Figure 6, the first line (A) represents a visualization of the heatmap generated by the resnet50 model, the first stream of our Dual-Stream Feature Extractor module. The result shows that the model’s attention is focused on the entire object found in the image. Specific attention is given to the overall shape of the fish and its silhouette. This stream did not focus on fine areas such as the eyes and fins. The second stream, the Local Transformer Stream, is shown in Figure 6B. As this stream is based on understanding the spatial relationships between different regions of the image, specific attention was paid to fine details. The heatmap visualization shows that there is a striking focus on the eyes, head, tail, and fins. Figure 6C shows the result obtained after an intermediate fusion of the two streams. The heatmap is larger, showing greater coverage of the entire fish object. These three visualizations confirm the complementarity described in Section 3.2.2.

4.3. Method Comparison

In this section, we evaluate the effectiveness of our new method, DeepFishNET+. In order to validate the results obtained, we opted for a comparison with other existing methods such as YOLO and ViT. The experimental results showed that our method outperformed these competing approaches, achieving higher scores on all evaluation metrics.

In the classification phase, DeepFishNET+ was compared with three reference architectures: ViT-B/16, ResNet50, and Swin Transformer. As shown in Table 4, our improved method achieved superior performance, scoring 98.43%, 98.28%, 98.21%, and 98.21% for Accuracy, Precision, Recall, and F1-Score, respectively. By comparison, ResNet50 achieved scores of 91.16%, 93.97%, 92.09%, and 91.01%, respectively, while ViT-B/16 scored 93.65%, 94.25%, 93.72%, and 93.32%. Swin Transformer achieved 96.86%, 96.92%, 96.68%, and 96.51%. For the same metrics, Accuracy, Precision, Recall, and F1-Score, DeepFishNET+ achieved 98.43%, 98.28%, 98.21%, and 98.21%, respectively. The results of our improved model were also compared with other recognized architectures in the detection task. As shown in Table 5, the scores obtained with DeepFishNET+ outperformed both these architectures, YOLOv7 and YOLOv8. Using the mAP@0.5 metric, for example, the results obtained were 90%, 90.82%, and 97.1% for the YOLOv7, YOLOv8, and DeepFishNET+ architectures, respectively.

Figure 7 contains two curves illustrating the Precision–confidence correlation for the two models YOLOv8 and DeepFishNET+. For YOLOv8, the Big Head Carp is the most difficult fish species to identify. Identification of Scat Fish is the easiest task. With a confidence between 0.4 and 0.6, YOLOv8 achieves an accuracy rate close to 80%. For DeepFishNET+, recognition of the Climbing Perch species is the most difficult task. This improved model achieves 90% accuracy with confidence thresholds between 0.4 and 0.6.

The evolution of the DeepFishNET+ model’s learning phase is shown in Figure 8. Over the course of the training cycles, these curves illustrate significant progress. At the beginning, the loss rates were high, which shows the difficulties faced by the model in properly assimilating the training data. However, as the model went through more iterations, the curves showed a clear downward trend, indicating a continuous improvement in performance. This downward convergence of the curves reflects the model’s gradual adaptation to the complexity of the training data. The decrease in error rates also demonstrates the model’s ability to generalize effectively, thereby avoiding overfitting. In summary, this graphical representation provides a dynamic view of the model’s learning process, highlighting its positive evolution as it is exposed to the training set.

4.4. Results Obtained by DeepFishNET+ Method

Figure 9 illustrates the classification results generated by different standard models without any improvements and by our proposed DeepFishNET+ method. The selected images present problems related to the similarity between the fish and the background and a lack of brightness. Image 1 shows a Big Head Carp, which is correctly recognized by the ViT-B/16 and DeepFishNET+ models, but is classified as Tilapia by ResNet50. Image 2 shows a Climbing Perch, with a high degree of similarity between the fish’s skin and the background. Only DeepFishNET+ made a correct classification. In image 3, there are two Jaguar Gapote fish. The ResNet50 model classified them as Scat Fish. The correct classifications were obtained by both ViT-B/16 and DeepFishNET+. Finally, in image 4, we present an image of a Bangus fish. Both the ViT-B/16 and DeepFishNET+ models correctly identified the fish type, while the ResNet50 model made an incorrect classification. Figure 10 shows the details of object locations using the YOLOv7, YOLOv8, and DeepFishNET+ models. In all images, the bounding boxes accurately delineate fish objects according to DeepFishNET+. The other two models, YOLOv7 and YOLOv8, demonstrated localization issues due to complex marine conditions.

4.5. Model Validation on Other Datasets

In order to evaluate our DeepFishNET+ model in greater depth, we trained, validated, and tested it on four datasets other than the main dataset used in our work. These datasets all contain images captured in difficult environmental conditions, such as in darkness, with poor water quality, at low resolution, and with similarity between the fish skin and the background. This choice of datasets was motivated by the challenge of locating and recognizing fish underwater. The goal is to create a robust model that can be generalized even in complex, difficult, and varied situations. The results obtained by DeepFishNET+ are encouraging during the testing phase, as shown in Table 6.

In the classification phase, and on the Fish-gres dataset, our model achieved an accuracy of 99.72%. The accuracy of the model with the Fish4Knowledge dataset is 99.12%. With a large-scale dataset, DeepFishNET+ achieved an accuracy of 96.86%. Finally, we applied our method to the Fish-Park dataset and achieved an accuracy rate of 98.26%. In the phase of detecting and locating fish in the scene, the accuracy of our model was 93.01%, 92.93%, 90.72%, and 91.48% for the Fish-gres dataset, Fish4Knowledge dataset, large-scale dataset, and Fish-Park dataset, respectively.

5. Discussion

In order to validate the results obtained by our DeepFishNET+ method, we opted to compare it with existing methods whose main task is to identify fish and recognize their types. Table 7 presents the approaches chosen, the datasets used in the experimental phases, and the results obtained. For example, a study by [34] presents an improved YOLOv5 model for fish recognition and localization. This model, called FishDETECT, uses a pre-trained model called FishMask in the context of transfer learning to better handle complex scenes and poor environmental conditions. The experimental results obtained indicate an accuracy rate of 96.2%, highlighting the effectiveness of the proposed model in various environments. However, despite these promising results, the approach remains primarily oriented towards object detection, and not specifically towards fine classification between similar species, which is essential in an aquaculture environment where slight morphological variations must be taken into account. In [35], the authors propose a fusion of the Swin Transformer and the FGVC-PIM module to improve the fine classification of fish species, while focusing on the most relevant regions. Tested on 14 datasets, this approach achieves satisfactory results, with accuracies above 83% in all cases. However, this approach is architecturally complex, highly dependent on fine-tuning parameters, and sensitive to similar classes and images containing multiple subjects. Furthermore, its generalization to real underwater contexts remains imperfect. In another study [36], the authors present a fish detection algorithm called CUIB-YOLO, based on the YOLOv8 architecture. The main objective is to meet the requirements related to computational complexity and limited hardware resources often encountered in the aquaculture sector. To achieve this, they introduce an innovative C2f-UIB module that replaces the classic C2f module at the neck of the network. In addition, they integrate the EMA mechanism to refine feature fusion. Thanks to these improvements, the model’s parameters are reduced to 2.5 M and its FLOPs to 7.5 G, representing a decrease of 15.7% and 7.5% respectively compared to YOLOv8n. Despite this reduction, performance remains stable with a mAP@0.5-0.95 of 76.4%, very close to the original model. The results show that this lighter model improves inference speed and real-time performance while maintaining detection accuracy comparable to existing approaches. This contribution takes into account the importance of lightweight models for embedded applications in aquaculture, where resources are limited. However, the study focuses primarily on comparison with YOLOv8n, without an in-depth evaluation of other competing lightweight architectures such as MobileNet, for example. Furthermore, the generalization of the model to different aquaculture contexts with different species, and in the presence of problems related to variable environmental conditions and underwater visual noise, is not clearly explored. A migration to more complex scenarios, integrating both multi-species detection and recognition, would represent a significant evolution. Our DeepFishNET+ method stands out not only for its superior efficiency, but also for its generality and performance in the face of environmental challenges often encountered in real aquatic environments.

Unlike many existing studies, the DeepFishNET+ model was tested on low-resolution images with reduced brightness, characterized by a strong similarity between the texture of the fishes’ skin and the background. These unfavorable conditions reflect real-life situations encountered in aquaculture environments, particularly in murky waters or conditions with low visibility. The results show that the CNN–Transformer combination improves the robustness of our methods in the face of these challenges.

On a technical level, although the DeepFishNet+ model incorporates advanced modules, its inference time remains compatible with near-real-time use thanks to the optimization of the YOLOv8-based pipeline. Memory usage and GPU performance show that DeepFishNET+ can be integrated into modern embedded platforms. This possibility paves the way for practical applications in aquaculture farms.

However, certain limitations have been identified. On the one hand, image processing in extremely difficult conditions remains a challenge, requiring even more robust image enhancement techniques. On the other hand, integrating the model on low-resource devices requires additional effort. These limitations represent future opportunities to improve the transferability and generalization of our approach.

6. Conclusions and Future Work

In this study, we proposed DeepFishNET+, an innovative method for detecting and classifying fish species in complex underwater environments. Our approach combines four main modules: the Underwater Image Enhancement Network, Dual-Stream Feature Extractor, Cross-Attention Feature Fusion, and Multi-Task Head. This combination has resulted in superior performance compared to existing methods, with accuracies of up to 98.28% for classification and 92.74% for detection. These results demonstrate the performance and effectiveness of DeepFishNET+ even in challenging environments with factors such as low light, continuous movement, and water turbidity. On a practical level, this method offers aquaculturists the possibility of real-time monitoring, with the potential for integration into onboard equipment, reducing dependence on human intervention. However, some limitations remain, including sensitivity to extreme conditions and the need to optimize real-time processing for onboard devices.

Future work may focus on extending the model to a greater number of species and environments, integrating behavioral tracking and analysis models, and evaluating its robustness in real aquaculture farms to confirm its practical and, above all, academic impact.

Author Contributions

Conceptualization, M.H., M.O.-E.A., M.R. and R.B.; methodology, M.H., M.O.-E.A., M.R. and R.B.; software, M.H. and M.O.-E.A.; validation, M.H., M.O.-E.A., M.R. and R.B.; formal analysis, M.H., M.O.-E.A., M.R. and R.B.; investigation, M.H.; resources, M.H., M.O.-E.A. and M.R.; data curation, M.H. and M.O.-E.A.; writing—original draft preparation, M.H.; writing—review and editing, M.O.-E.A., M.R. and R.B.; visualization, M.H., M.O.-E.A., M.R. and R.B.; supervision, R.B.; project administration, M.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2502).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

FAO. The State of World Fisheries and Aquaculture 2021; FAO: Rome, Italy, 2021; p. 254. [Google Scholar]
Financial Times. Aquaculture Overtakes Wild Fishing as Main Source of Fish. Available online: https://www.ft.com/content/140ed100-c288-4b20-a9c3-fac16164c7e5 (accessed on 8 September 2025).
FAO. World Fisheries and Aquaculture 2020; FAO: Rome, Italy, 2020; pp. 1–244. [Google Scholar]
Sun, K.; Cui, W.; Chen, C. Review of underwater sensing technologies and applications. Sensors 2021, 21, 7849. [Google Scholar] [CrossRef]
Braithwaite, V.A.; Ebbesson, L.O.E. Pain and stress responses in farmed fish. Rev. Sci. Tech. 2014, 33, 245–253. [Google Scholar] [CrossRef]
Mandal, A.; Ghosh, A.R. Role of artificial intelligence (AI) in fish growth and health status monitoring: A review on sustainable aquaculture. Aquac. Int. 2024, 32, 2791–2820. [Google Scholar] [CrossRef]
Li, D.; Zhang, Y.; Wang, X.; Liu, H.; Chen, L.; Zhao, Q. Advanced techniques for the intelligent diagnosis of fish diseases: A review. Animals 2022, 12, 2938. [Google Scholar] [CrossRef]
Hamzaoui, M.; Ould-Elhassen Aoueileyine, M.; Bouallegue, S.; Bouallegue, R. Enhanced detection of Argulus and epizootic ulcerative syndrome in fish aquaculture through an improved deep learning model. J. Aquat. Anim. Health 2025, 37, 97–109. [Google Scholar] [CrossRef] [PubMed]
Alsakar, Y.M.; Sakr, N.A.; El-Sappagh, S.; Abuhmed, T.; Elmogy, M. Underwater image restoration and enhancement: A comprehensive review of recent trends, challenges, and applications. Vis. Comput. 2025, 41, 3735–3783. [Google Scholar] [CrossRef]
Elmezain, M.; Saoud, L.S.; Sultan, A.; Heshmat, M.; Seneviratne, L.; Hussain, I. Advancing underwater vision: A survey of deep learning models for underwater object recognition and tracking. IEEE Access 2025, 13, 17830–17867. [Google Scholar] [CrossRef]
Naveen, P. Advancements in underwater imaging through machine learning: Techniques, challenges, and applications. Multimed. Tools Appl. 2025, 84, 24839–24858. [Google Scholar] [CrossRef]
Liu, Z.; Chen, H.; Wang, J.; Li, Y.; Zhao, Q.; Yang, T. UnitModule: A lightweight joint image enhancement module for underwater object detection. Pattern Recognit. 2024, 151, 110435. [Google Scholar] [CrossRef]
Pachaiyappan, P.; Kumar, A.; Ramesh, S.; Natarajan, K. Enhancing underwater object detection and classification using advanced imaging techniques: A novel approach with diffusion models. Sustainability 2024, 16, 7488. [Google Scholar] [CrossRef]
Guan, F.; Li, J.; Wang, H.; Chen, L.; Zhao, Y.; Zhang, X.; Liu, Y. AUIE–GAN: Adaptive underwater image enhancement based on generative adversarial networks. J. Mar. Sci. Eng. 2023, 11, 1476. [Google Scholar] [CrossRef]
Cong, R.; Li, Y.; Wang, X.; Chen, Z.; Zhao, L.; Liu, J. Pugan: Physical model-guided underwater image enhancement using GAN with dual-discriminators. IEEE Trans. Image Process. 2023, 32, 4472–4485. [Google Scholar] [CrossRef] [PubMed]
Hao, X.; Liu, L. DGC-UWnet: Underwater image enhancement based on computation-efficient convolution and channel shuffle. IET Image Process. 2023, 17, 2158–2167. [Google Scholar] [CrossRef]
Chen, J.; Li, X.; Wang, Y.; Zhang, H. Collaborative compensative transformer network for salient object detection. Pattern Recognit. 2024, 154, 110600. [Google Scholar] [CrossRef]
Meng, L.; Zhao, Q.; Sun, Y.; Liu, J.; Wang, T. RGB depth salient object detection via cross-modal attention and boundary feature guidance. IET Comput. Vis. 2024, 18, 273–288. [Google Scholar] [CrossRef]
Saoud, L.S.; Seneviratne, L.; Hussain, I. ADOD: Adaptive domain-aware object detection with residual attention for underwater environments. In Proceedings of the 21st International Conference on Advanced Robotics (ICAR), Abu Dhabi, United Arab Emirates, 5–8 December 2023. [Google Scholar]
Wen, J.; Li, Y.; Zhang, T.; Zhao, H.; Liu, X.; Chen, L.; Wang, S. EnYOLO: A real-time framework for domain-adaptive underwater object detection with image enhancement. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Saad Saoud, L.; Seneviratne, L.; Hussain, I. MARS: Multi-Scale Adaptive Robotics Vision for Underwater Object Detection and Domain Generalization. arXiv 2023, arXiv:2312.15275. [Google Scholar] [CrossRef]
Dai, L.; Wang, X.; Li, J.; Chen, H.; Zhao, Q.; Liu, Y. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
Folkman, L.; Pitt, K.A.; Stantic, B. A data-centric framework for combating domain shift in underwater object detection with image enhancement. Appl. Intell. 2025, 55, 272. [Google Scholar] [CrossRef]
Agrawal, A.; Singh, P.; Kumar, S.; Ramesh, T.; Natarajan, K. Syn2Real Domain Generalization for Underwater Mine-like Object Detection Using Side-Scan Sonar. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5503105. [Google Scholar] [CrossRef]
Chen, L.; Li, X.; Zhang, Y.; Wang, H.; Zhao, Q.; Liu, T. SWIPENET: Object detection in noisy underwater scenes. Pattern Recognit. 2022, 132, 108926. [Google Scholar] [CrossRef]
Chen, G.; Li, H.; Wang, X.; Zhang, Y.; Zhao, Q.; Liu, T. HTDet: A hybrid transformer-based approach for underwater small object detection. Remote Sens. 2023, 15, 1076. [Google Scholar] [CrossRef]
Wang, Z.; Ruan, Z.; Chen, C. DyFish-DETR: Underwater fish image recognition based on detection transformer. J. Mar. Sci. Eng. 2024, 12, 864. [Google Scholar] [CrossRef]
Pavithra, S.; Kumar, A.; Ramesh, T.; Natarajan, K. An efficient approach to detect and segment underwater images using Swin Transformer. Results Eng. 2024, 23, 102460. [Google Scholar] [CrossRef]
Lei, J.; Wang, X.; Li, H.; Zhao, Q.; Chen, L.; Liu, T. CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation. Remote Sens. 2025, 17, 707. [Google Scholar] [CrossRef]
Prasetyo, E.; Suciati, N.; Fatichah, C. Fish-gres dataset for fish species classification. Mendeley Data 2020, 10, 12. [Google Scholar] [CrossRef]
Fisher, R.B.; Li, X.; Zhang, Y.; Chen, H. (Eds.) Fish4Knowledge: Collecting and Analyzing Massive Coral Reef Fish Video Data; Springer: Berlin/Heidelberg, Germany, 2016; Volume 104. [Google Scholar]
Ulucan, O.; Karakaya, D.; Turkan, M. A large-scale dataset for fish segmentation and classification. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020. [Google Scholar]
Shah, S.Z.H.; Khan, A.; Riaz, F.; Ahmed, I. Fish-Pak: Fish species dataset from Pakistan for visual features based classification. Data Brief 2019, 27, 104565. [Google Scholar] [CrossRef]
Hamzaoui, M.; Ould-Elhassen Aoueileyine, M.; Romdhani, L.; Bouallegue, R. An improved deep learning model for underwater species recognition in aquaculture. Fishes 2023, 8, 514. [Google Scholar] [CrossRef]
Veiga, R.J.M.; Rodrigues, J.M.F. Fine-Grained Fish Classification from small to large datasets with Vision Transformers. IEEE Access 2024, 12, 113642–113660. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, S. Research on improved lightweight fish detection algorithm based on YOLOv8n. J. Mar. Sci. Eng. 2024, 12, 1726. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed method in the training phase.

Figure 2. Cross-Attention Feature Fusion architecture.

Figure 3. Multi-Task Head architecture.

Figure 4. Capture of batch 0 and batch 12 contents.

Figure 5. Image of Big Head Carp before and after improvement.

Figure 6. Heatmap visualization for the Dual-Stream Feature Extractor module.

Figure 7. Precision–confidence curves of YOLOv8 and DeepFishNET+ models.

Figure 8. Learning curve evolution of DeepFishNET+ method.

Figure 9. Classification results obtained by different models.

Figure 10. Detection results obtained by different models.

Table 1. The label distribution in the dataset.

Fish Species	Scientific Name	Count of Samples	Train	Validation	Test
Bangus	Chanos chanos	223	156	34	33
Big Head Carp	Aristichthys nobilis	213	150	32	31
Black Spotted Barb	Puntius binotatus	190	133	29	28
Climbing Perch	Anabas testudineus	190	133	29	28
Fourfinger Threadfin	Eleutheronema tetradactylum	190	133	29	28
Glass Perchlet	Ambassis vachellii	190	133	29	28
Gourami	Trichopodus trichopterus	190	133	29	28
Jaguar Gapote	Parachromis managuensis	190	133	29	28
Scat Fish	Scatophagus argus	190	133	29	28
Tilapia	Oreochromis niloticus	190	133	29	28

Table 2. DeepFishNET+ hyperparameters.

Parameter	Value Selected
Optimizer	AdamW (lr = $3 \times 10^{- 4}$ , weight decay = 0.05)
LR Schedule	Warmup (lr: 0 → 0.001) + cosine decay (lr: 0.001 → $1 \times 10^{- 6}$ )
Batch Size	256
Regularization	Dropout = 0.1, StochDepth = 0.2

Table 3. Image quality assessment before and after enhancement.

Metric	Original	After UIE-Net Enhancement
PSNR	18.2 dB	25.6 dB
SSIM	0.58	0.79
UIQM	6.04	7.18
UCIQE	0.44	0.63

Table 4. Performance metrics for classification using ViT-B/16, ResNet50, Swin Transformer, and DeepFishNET+.

Model	Accuracy	Precision	Recall	F1-Score
ResNet50	91.16%	93.97%	92.09%	91.01%
ViT-B/16	93.65%	94.25%	93.72%	93.32%
Swin Transformer	96.86%	96.92%	96.68%	96.51%
DeepFishNET+	98.43%	98.28%	98.21%	98.21%

Table 5. Performance metrics for detection using YOLOv8, Faster R-CNN, ViT-B/16, and DeepFishNET+.

Model	Precision	mAP50	Box Loss	Time (ms)
YOLOv7	83.12%	86.91%	0.60	511.655
YOLOv8	89.82%	90.82%	0.57	529.443
DeepFishNET+	92.74%	97.10%	0.42	542.106

Table 6. Performance of DeepFishNET+ on different fish datasets.

Datasets	Fish Species	Precision in Classification (%)	Precision in Detection (%)
Fish-gres dataset [30]	Chanos chanos, Johnius trachycephalus, Nibea albiflora, Rastrelliger faughni, Upeneus moluccensis, Eleutheronema tetradactylum, Oreochromis mossambicus, and Oreochromis niloticus	99.72	93.01
Fish4Knowledge dataset [31]	Acanthuridae, Pomacentridae, Labridae, Chaetodontidae, Balistidae, Serranidae	99.12	92.93
A large-scale dataset [32]	Gilt head bream, Red sea bream, Sea bass, Red mullet, Horse mackerel, Black sea sprat, Striped red mullet, Trout, Shrimp	96.86	90.72
Fish-Pak dataset [33]	Grass carp, Common carp, Mori, Rohu, Silver carp, Thala	98.26	91.48

Table 7. Comparison of existing work with our approach, DeepFishNet+.

Work	Approach	Dataset	Results
[34]	This approach develops an improved YOLOV5 model for locating and classifying fish types. Transfer learning is applied. The final model is based on the weights of another pre-trained model called FishMask, which was itself trained on a dataset containing images of fish masks.	A large-scale dataset	Accuracy: 96%
[35]	The proposed approach integrates the Fine-Grained Visual Classification Plugin Module (FGVC-PIM) with the Swin Transformer architecture. While the FGVC-PIM concentrates on identifying the most discriminative regions within an image, the Swin Transformer ensures robust feature extraction. The model was evaluated on multiple datasets under diverse environmental conditions, achieving promising results with accuracy above 83%.	Fish-gres dataset, Fish4Knowledge dataset, Fish Park dataset, A large-scale dataset, among others	Accuracy: Above 83%
[36]	The proposed CUIB-YOLO algorithm introduces a C2f-UIB module to reduce model parameters and integrates the EMA mechanism into the neck network to optimize feature fusion.	Roboflow Universe dataset library	mAP@0.5: 95.7%
Our work	DeepFishNet+ begins by training a pre-trained VGG16 on ImageNet. DeepLIFT determines heat zones, which are divided into patches. Every two patches representing the same zone are concatenated into a single vector. ViT performs the final classification on these concatenated vectors.	F-DS1, F-DS2	99.72% (F-DS2)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hamzaoui, M.; Rejili, M.; Aoueileyine, M.O.-E.; Bouallegue, R. DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification. Appl. Sci. 2025, 15, 10870. https://doi.org/10.3390/app152010870

AMA Style

Hamzaoui M, Rejili M, Aoueileyine MO-E, Bouallegue R. DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification. Applied Sciences. 2025; 15(20):10870. https://doi.org/10.3390/app152010870

Chicago/Turabian Style

Hamzaoui, Mahdi, Mokhtar Rejili, Mohamed Ould-Elhassen Aoueileyine, and Ridha Bouallegue. 2025. "DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification" Applied Sciences 15, no. 20: 10870. https://doi.org/10.3390/app152010870

APA Style

Hamzaoui, M., Rejili, M., Aoueileyine, M. O.-E., & Bouallegue, R. (2025). DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification. Applied Sciences, 15(20), 10870. https://doi.org/10.3390/app152010870

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DeepFishNET+: A Dual-Stream Deep Learning Framework for Robust Underwater Fish Detection and Classification

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset

3.2. Proposed Method

3.2.1. Underwater Image Enhancement Network (UIE-Net)

3.2.2. Dual-Stream Feature Extractor

3.2.3. Cross-Attention Feature Fusion

3.2.4. Multi-Task Head

3.3. Evaluation Metrics

3.4. Experimental Setup

4. Results

4.1. Validation of Results Obtained by UIE-Net

4.2. Validation of Results Obtained by Dual-Stream Feature Extractor Module

4.3. Method Comparison

4.4. Results Obtained by DeepFishNET+ Method

4.5. Model Validation on Other Datasets

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI