Research on Fish Recognition in Complex Backgrounds Using ViT-Enhanced YOLOv11
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset Construction
2.2. Data Preprocessing
2.3. YOLOv11 Model
2.4. ViT Technology
2.5. ViT and YOLOv11 Fusion Model
- (1)
- YOLOv11-ViT1 Model: This model embeds a ViT module at the beginning of the YOLOv11 Head. The network architecture diagram is shown in Figure 4. The specific configuration is as follows: the feature map is divided into 20 × 20 patches, projected into a 256-dimensional vector space, and then fed into an encoder composed of 10 Transformer blocks. Each block employs 8 attention heads, with the MLP hidden layer dimension expanded to 2048, and a dropout rate of 0.1 is applied. This design enables the model to leverage ViT’s global context modeling capability to reintegrate and enhance the multi-scale features from the Backbone and Neck before final detection, thereby improving target localization and classification accuracy.
- (2)
- YOLOv11-ViT2 Model: This model embeds an enhanced ViT module at the end of the YOLOv11 Backbone. The network architecture diagram is shown in Figure 5. Its ViT configuration is further optimized: the patch embedding dimension is increased to 512, the number of attention heads is raised to 4, the encoder layers are adjusted to 16, and a C2fAttn attention enhancement module is introduced into the Head. This module incorporates a multi-head self-attention mechanism into the standard C2f module. The specific structure is as follows: the input feature map is split and fed into two branches. One branch maintains convolutional feature extraction, while the other captures global dependencies through multi-head self-attention. Finally, the outputs from the two branches are fused and feature compression is performed via pointwise convolution. This design enables the model to balance local texture and long-range contextual information while remaining lightweight, thereby improving the recognition ability of key fish parts. This structure positions ViT as a powerful feature extractor, integrating global contextual information without disrupting the integrity of the feature pyramid, significantly boosting the model’s feature representation and fusion capabilities.
2.6. Experimental Setting
2.7. Experimental Environment
2.8. Evaluation Index
2.9. Ablation Experiment Setup
3. Results
3.1. Comparison of Model Detection Accuracy
3.2. Comparison of Evaluation Metrics
3.3. Ablation Experiment Results
3.4. Classification Results for the Single Fish Species Scenario
3.5. Classification Results for the Multiple Classes Separated Scenario
3.6. Classification Results for the Slight Overlap of Multiple Fish Species Scenario
3.7. Classification Results for the Severe Overlap of Multiple Fish Species Scenario
4. Discussion
4.1. The Model’s Recognition Performance in Complex Background
4.2. Model Improvement Effect and Performance Improvement Analysis
4.3. Limitations Regarding ViT Hyperparameter Selection
4.4. The Actual Application of the Model Is Insufficient and Needs to Be Improved
- (1)
- Integrating lightweight Transformer architectures (such as Swin Transformer or MobileViT) to reduce model parameters and computational complexity while maintaining accuracy, thereby improving real-time inference performance;
- (2)
- Introducing temporal modeling mechanisms (e.g., 3D convolutions or temporal Transformers) to incorporate inter-frame correlation features during training, enabling dynamic recognition of fish movement behaviors and posture changes;
- (3)
- Fusing multi-modal information (such as infrared images, depth images, and environmental illumination data) to build a multi-source perception framework for fish recognition, enhancing model robustness in low-light, strong-glare, and severe occlusion scenarios.
5. Conclusions
- (1)
- The improved YOLOv11-ViT model achieves superior comprehensive performance on the FishRecognition-2025 dataset. The recall rate is increased to 81.99%, mAP50-95 is improved to 0.6460, and the F1 score reaches 0.8731. The model demonstrates more stable recognition results in complex backgrounds.
- (2)
- The ViT module enhances the model’s ability to distinguish overlapping and occluded targets through its global self-attention mechanism. This compensates for the limitations of YOLO series models in terms of local feature receptive fields, significantly reducing false detection and missed detection rates.
- (3)
- The improved model outperforms traditional convolutional models in simulated complex deck environments and stacked fish bodies, suggesting a promising technical direction for intelligent fisheries monitoring and automated identification of fishing data.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, B.; Zhang, F.; Sun, Y.; Li, X.; Liu, P.; Liu, L.; Miao, Z. Underwater Target Recognition via Cayley-Klein Measure and Shape Prior Information in Hyperspectral Imaging. Appl. Sci. 2023, 13, 7854. [Google Scholar] [CrossRef]
- Zhao, Y.Y.; Yang, P. Review of research on fish body length measurement based on machine vision. Trans. Chin. Soc. Agric. Mach. 2021, 52, 207–218. [Google Scholar]
- Lu, Y.C.; Tung, C.; Kuo, Y.F. Identifying the species of harvested tuna and billfish using deep convolutional neural networks. ICES J. Mar. Sci. 2020, 77, 1318–1329. [Google Scholar]
- Yusup, I.M.; Iqbal, M.; Jaya, I. Real-time reef fishes identification using deep learning. IOP Conf. Ser. Earth Environ. Sci. 2020, 429, 012046. [Google Scholar] [CrossRef]
- Villon, S.; Mouillot, D.; Chaumont, M.; Darling, E.S.; Subsol, G.; Claverie, T.; Villéger, S. A deep learning method for accurate and fast identification of coral reef fishes in underwater images. Ecol. Inform. 2018, 48, 238–244. [Google Scholar] [CrossRef]
- Knausgard, K.M.; Wiklund, A.; Sørdalen, T.K.; Halvorsen, K.T.; Kleiven, A.R.; Jiao, L.; Goodwin, M. Temperate fish detection and classification: A deep learning based approach. Appl. Intell. 2022, 52, 6988–7001. [Google Scholar]
- Bartholomew, D.C.; Mangel, J.C.; Alfaro-Shigueto, J.; Pingo, S.; Jimenez, A.; Godley, B.J. Remote electronic monitoring as a potential alternative to on-board observers in small-scale fisheries. Biol. Conserv. 2018, 219, 35–45. [Google Scholar] [CrossRef]
- French, G.; Mackiewicz, M.; Fisher, M.; Holah, H.; Kilburn, R.; Campbell, N.; Needle, C. Deep neural networks for analysis of fisheries surveillance video and automated monitoring of fish discards. ICES J. Mar. Sci. 2020, 77, 1340–1353. [Google Scholar]
- Vilas, C.; Antelo, L.T.; Martin-Rodríguez, F.; Morales, X.; Perez-Martin, R.; Alonso, A.; Valeiras, J.; Abad, E.; Quinzan, M.; Barral-Martinez, M. Use of computer vision onboard fishing vessels to quantify catches: The iObserver. Mar. Policy 2020, 116, 103714. [Google Scholar] [CrossRef]
- Ovalle, J.C.; Vilas, C.; Antelo, L.T. On the use of deep learning for fish species recognition and quantification on board fishing vessels. Mar. Policy 2022, 139, 105015. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
- Zhang, Z.; Qu, Y.; Wang, T.; Rao, Y.; Jiang, D.; Li, S.; Wang, Y. An improved YOLOv8n used for fish detection in natural water environments. Animals 2024, 14, 2022. [Google Scholar] [CrossRef] [PubMed]
- Zouin, B.; Zahir, J.; Baletaud, F.; Vigliola, L.; Villon, S. Improving CNN fish detection and classification with tracking. Appl. Sci. 2024, 14, 10122. [Google Scholar] [CrossRef]
- Yuan, H.C.; Tao, L. Detection and identification of fish in electronic monitoring data of commercial fishing vessels based on improved Yolov8. J. Dalian Ocean Univ. 2023, 38, 533–542. [Google Scholar]
- Robillard, A.J.; Trizna, M.G.; Ruiz-Tafur, M.; Panduro, E.L.D.; de Santana, C.D.; White, A.E.; Dikow, R.B.; Deichmann, J.L. Application of a deep learning image classifier for identification of Amazonian fishes. Ecol. Evol. 2023, 13, e9987. [Google Scholar] [CrossRef] [PubMed]
- Tseng, C.H.; Kuo, Y.F. Detecting and counting harvested fish and identifying fish types in electronic monitoring system videos using deep convolutional neural networks. ICES J. Mar. Sci. 2020, 77, 1367–1378. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Gao, H.; Hu, C.; Han, G.; Mao, J.; Huang, W.; Guan, Q. Point-level feature learning based on vision transformer for occluded person re-identification. Image Vis. Comput. 2024, 143, 104929. [Google Scholar] [CrossRef]
- Rodrigo, M.; Cuevas, C.; García, N. Comprehensive comparison between vision transformers and convolutional neural networks for face recognition tasks. Sci. Rep. 2024, 14, 21392. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Han, S.; Liu, D.; Ming, D. Focus and imagine: Occlusion suppression and repairing transformer for occluded person re-identification. Neurocomputing 2024, 578, 127442. [Google Scholar] [CrossRef]
- Latif, S.A.; Sidek, K.A.; Hashim, A.H.A. An efficient iris recognition technique using cnn and vision transformer. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 34, 235–245. [Google Scholar]
- Zheng, X.; Luo, Y.; Zhou, P.; Wang, L. Distilling efficient vision transformers from cnns for semantic segmentation. Pattern Recognit. 2025, 158, 111029. [Google Scholar] [CrossRef]
- Xu, Y.; Li, J.; Dong, Y.; Zhang, X. Survey of development of YOLO object detection algorithms. J. Front. Comput. Sci. Technol. 2024, 18, 2221–2238. [Google Scholar]
- Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
- Hussain, M. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 2017; NeurIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]











| Models | Yolov11 | YOLOv11-ViT1 | YOLOv11-ViT2 | CNN | Yolov8 | VGG | YOLOv11-CBAM |
|---|---|---|---|---|---|---|---|
| single fish species | 0.995 | 0.9685 | 0.995 | 0.995 | 0.9950 | 0.9546 | 0.9950 |
| multiple classes separated | 0.9731 | 0.9741 | 0.9774 | 0.9807 | 0.9671 | 0.9428 | 0.9750 |
| slight overlap of multiple fish species | 0.8099 | 0.7850 | 0.8551 | 0.7976 | 0.8529 | 0.6978 | 0.7810 |
| severe overlap of multiple fish species | 0.6982 | 0.6662 | 0.7867 | 0.6749 | 0.7395 | 0.5626 | 0.7105 |
| Models | Yolov11 | YOLOv11-ViT1 | YOLOv11-ViT2 | CNN | Yolov8 | VGG | YOLOv11-CBAM |
|---|---|---|---|---|---|---|---|
| single fish species | 0.8908 | 0.7825 | 0.82668 | 0.8349 | 0.8782 | 0.7321 | 0.8477 |
| multiple classes separated | 0.8037 | 0.8045 | 0.76125 | 0.8137 | 0.8032 | 0.6883 | 0.7746 |
| slight overlap of multiple fish species | 0.5312 | 0.5218 | 0.5568 | 0.5656 | 0.5979 | 0.3882 | 0.5024 |
| severe overlap of multiple fish species | 0.4287 | 0.3962 | 0.4502 | 0.4120 | 0.4962 | 0.2899 | 0.4098 |
| Index | Precision | Recall | F1 | Parameters (M) | FLOPs (G) | GPU Memory (MB) | Latency (ms) | FPS |
|---|---|---|---|---|---|---|---|---|
| VGG | 0.7713 | 0.6807 | 0.7207 | 0.89 | 2.21 | 26.98 | 2.41 | 414.5 |
| YOLOv8 | 0.8152 | 0.7766 | 0.8365 | 3.01 | 4.10 | 47.25 | 6.57 | 152.1 |
| YOLOv11-CBAM | 0.8540 | 0.7725 | 0.8092 | 2.59 | 3.22 | 81.77 | 9.06 | 110.4 |
| Yolov11 | 0.8671 | 0.8044 | 0.8346 | 2.78 | 3.33 | 72.70 | 12.56 | 79.6 |
| CNN | 0.8587 | 0.7815 | 0.8183 | 3.53 | 3.64 | 134.81 | 9.60 | 104.2 |
| Yolov11-ViT1 | 0.8232 | 0.7929 | 0.8078 | 3.15 | 3.75 | 102.5 | 13.2 | 75.8 |
| Yolov11-ViT2 | 0.8902 | 0.8686 | 0.8792 | 2.59 | 3.22 | 119.27 | 8.69 | 115.0 |
| Models | ViT (Backbone) | C2fAttn (Head) | Single Fish Species mAP50 | Multiple Classes Separated mAP50 | Slight Overlap of Multiple Fish Species mAP50 | Severe Overlap of Multiple Fish Species mAP50 | Severe Overlap of Multiple Fish Species mAP50-95 | F1 (Global) |
|---|---|---|---|---|---|---|---|---|
| A (baseline) | x | x | 0.9950 | 0.9731 | 0.8099 | 0.6982 | 0.4287 | 0.8346 |
| B (ViT only) | √ | x | 0.9950 | 0.9762 | 0.8433 | 0.7713 | 0.4412 | 0.8512 |
| C (C2fAttn only) | x | √ | 0.9950 | 0.9745 | 0.8210 | 0.7305 | 0.4358 | 0.8428 |
| D (full) | √ | √ | 0.9950 | 0.9774 | 0.8551 | 0.7867 | 0.4502 | 0.8792 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
He, X.; Yang, S.; Wang, W.; Zhu, K.; Zhang, S.; Dai, Y.; Jiang, K.; Wang, F. Research on Fish Recognition in Complex Backgrounds Using ViT-Enhanced YOLOv11. Fishes 2026, 11, 385. https://doi.org/10.3390/fishes11070385
He X, Yang S, Wang W, Zhu K, Zhang S, Dai Y, Jiang K, Wang F. Research on Fish Recognition in Complex Backgrounds Using ViT-Enhanced YOLOv11. Fishes. 2026; 11(7):385. https://doi.org/10.3390/fishes11070385
Chicago/Turabian StyleHe, Xiangshuo, Shenglong Yang, Wei Wang, Kai Zhu, Shengmao Zhang, Yang Dai, Keji Jiang, and Fei Wang. 2026. "Research on Fish Recognition in Complex Backgrounds Using ViT-Enhanced YOLOv11" Fishes 11, no. 7: 385. https://doi.org/10.3390/fishes11070385
APA StyleHe, X., Yang, S., Wang, W., Zhu, K., Zhang, S., Dai, Y., Jiang, K., & Wang, F. (2026). Research on Fish Recognition in Complex Backgrounds Using ViT-Enhanced YOLOv11. Fishes, 11(7), 385. https://doi.org/10.3390/fishes11070385

