A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation

Zhang, Daode; Zhang, Shihao; Deng, Wupeng; Lu, Enshun; Xie, Zhiwei

doi:10.3390/app16052441

Open AccessArticle

A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation

by

Daode Zhang

^1,2,

Shihao Zhang

¹,

Wupeng Deng

^1,*,

Enshun Lu

² and

Zhiwei Xie

¹

Hubei Key Laboratory of Modern Manufacturing Quality Engineering, School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China

²

Institute of Agricultural Machinery Engineering Design and Research, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2441; https://doi.org/10.3390/app16052441

Submission received: 14 January 2026 / Revised: 11 February 2026 / Accepted: 18 February 2026 / Published: 3 March 2026

Download

Browse Figures

Versions Notes

Abstract

In recirculating aquaculture systems, fish detection is an essential component for maintaining effective farming operations. The availability of high-quality fish datasets is limited because of the richness of fish species, and the annotation of large-scale data, which is used to train models, is often labor-intensive and time-consuming. The presence of different fish species across batches introduces further challenges for consistent detection performance. This work introduces a few-shot learning approach for fish detection, utilizing a customized dataset as novel classes and the Fish4Knowledge dataset for base classes, thereby establishing a framework that enhances adaptability in data-scarce scenarios. Within the model architecture, multi-scale feature extraction is enhanced through an attention mechanism, which is integrated as a dedicated module to strengthen representation learning, thus enhancing the model’s capability to differentiate visually similar fish species. Two distinct customized fish datasets are employed to evaluate the robustness of the proposed method. Experimental results show that the proposed model performs competitively against TFA, Meta-RCNN, and VFA. In the base-training phase, it achieves a mAP of 0.775, slightly surpassing VFA, while in the 1-shot, 5-shot, and 10-shot fine-tuning settings, it obtains mAP values of 0.152, 0.247, and 0.265, respectively. A similar trend is observed on a subset of black fish, with mAP scores of 0.169, 0.253, and 0.286 in the corresponding few-shot settings. These results indicate that the proposed approach can maintain relatively stable detection accuracy and adaptability across different fish batches, offering a practical solution for fish detection tasks in aquaculture when annotated data is scarce. To further demonstrate the efficacy and practical utility of the proposed methodology, a case study in fish farming confirms that the enhanced model achieves consistent and precise detection across diverse fish species, even when trained with limited annotated data.

Keywords:

few-shot learning; fish detection; recirculating aquaculture systems; SA-ResNet; visual feature augmentation

1. Introduction

As the demand for aquatic resources continues to rise, recirculating aquaculture systems (RASs) are gaining significant attention as an efficient and sustainable approach to farming. Under modern high-density aquaculture conditions, close observation of fish group behavior is critical for mitigating farming risks and minimizing avoidable loss [1]. Conventional manual approaches, which depend on checking water quality and fish mortality to infer population health, are inefficient, costly, and incapable of providing real-time information. Consequently, deep learning has emerged as a vital approach within the aquaculture industry.

Deep learning [2] has become increasingly widespread in aquaculture, particularly in areas such as aquatic products detection [3], classification [4], and behavior recognition [5]. Using deep learning [6] to detect organism states and analyze water qualities is currently a hot research topic. However, the deployment of deep learning-based object detection methods like fish detection still faces severe challenges. Acquiring enough labeled data for training is a bottleneck. Existing deep learning methods rely on large amounts of labeled data, which is not easily achievable in practice. At the same time, the scarcity and limited diversity of available datasets, along with overly homogeneous fish species and environments, often result in poor generalization performance in trained models.

To tackle the challenges discussed above, this study proposes a few-shot learning (FSL) [7] framework, further strengthened by visual feature augmentation [8]. FSL fundamentally aims to use a small number of samples to develop a model with a robust generalization capacity, enabling effective performance on previously unseen categories. The framework employs open-source datasets to define base classes and integrates a tailored dataset for novel classes to construct the detection model. The proposed approach offers a tailored solution to address the challenges faced in real-world aquaculture applications [9]. The study’s contributions can be summarized as follows:

(1): Introducing a few-shot learning strategy based on the integration of meta-learning and a feature adaptation mechanism, enabling the deep learning model to quickly adapt to unknown fish detection, even with only a few labeled samples.
(2): Optimizing the few-shot learning framework by adopting ResNet and a split-attention mechanism to improve the deep learning model’s feature extraction capabilities under multi-scale and complex conditions.

In summary, the study highlights the application of integrating few-shot learning with visual feature augmentation to address the limitations of conventional deep learning methods in aquaculture applications. Traditional detection approaches are constrained by the scarcity of labeled datasets and limited generalization across diverse species and environments, which significantly hinders their practical deployment. By combining meta-learning strategies with a feature adaptation mechanism, the proposed framework demonstrates the ability to adapt quickly to novel fish categories with minimal labeled samples. Moreover, the integration of ResNet and split-attention modules augments the model’s proficiency in multi-scale feature extraction under intricate aquaculture conditions. These advancements establish a viable framework for real-time, efficient, and scalable fish detection, thereby mitigating operational risks and promoting the sustainability of recirculating aquaculture systems.

The structure of this paper is as follows: Section 2 reviews related work on fish detection and few-shot learning. Section 3 presents the experimental design and proposed improvements. Section 4 discusses the experimental results and analysis. Section 5 provides a discussion and conclusion of the results, along with suggestions for future research.

2. Related Work

2.1. Fish Detection

The increasing scarcity and value of aquatic resources have raised higher requirements for sustainable aquaculture practices. As recirculating aquaculture systems (RASs) become a dominant model for intensive farming, accurate detection of fish populations plays a crucial role in ensuring production efficiency and ecological balance. Conventional manual methods often lack precision, timeliness, and scalability. In contrast, visual sensing-based fish detection provides a direct and real-time means to identify and locate individuals in high-density environments, thereby supporting health assessment, feeding optimization, and early anomaly detection. Consequently, fish detection has become a focal research area and is widely regarded as a key enabling technology for reducing operational risks and costs in modern aquaculture systems.

Conventional object detection methods largely rely on shallow models such as support vector machines [10], K-means clustering [11], and logistic regression [12], whose detection performance is limited by their feature representation ability. With the development of deep learning, detection methods based on convolutional neural networks (CNNs) [13], such as YOLO [14] and Faster R-CNN [15], have gradually been introduced into the RAS environment to detect fish in complex underwater backgrounds. Vijayalakshmi M et al. [16] proposed a fish detection model based on YOLO, called AquaYOLO, aimed at optimizing real-time monitoring in small-scale aquaculture ponds. It enhances feature extraction and multi-scale fusion capabilities, effectively addressing complex environmental challenges such as murky water, lighting variations, and interference from aquatic plants. Nathasarma R et al. [17] used the YOLOv8 model for real-time detection and classification of native fish species in Bangladesh, achieving excellent performance with high precision and real-time capabilities. Sun et al. [18] introduced an enhanced Faster R-CNN model, which improved the model’s performance in small-sample marine fish image recognition by incorporating a random vector functional link network. In response to the challenges of dead fish recognition in complex environments such as those with water surface reflection and low contrast, Tong et al. [19] proposed a YOLO-DWM model based on improved YOLOv5s. Through multi-scale convolution and lightweight design, the model achieved significant improvements in mAP and F1 scores compared to various detection models in experiments. Wang et al. [20] propose a fish detection model that integrates Faster Net-Block, C2f, and multi-scale EMA attention, significantly improving detection accuracy, computational efficiency, and parameter optimization for intelligent fish species detection in fishways.

Despite CNNs achieving progress in fish detection, there remain several practical problems [21]. First, open-source datasets are predominantly derived from open-sea environments, where water clarity, lighting, and fish behavior differ substantially from those in intensive aquaculture settings. In contrast, aquaculture water bodies often have higher turbidity, variable lighting, and more confined spaces, which pose additional challenges for underwater image capture and reduce the quality and usability of collected visual data, leading to limited robustness when models are applied across domains. Second, the challenging conditions in aquaculture environments—such as high turbidity, inconsistent lighting, and equipment limitations—make it difficult to collect high-quality data, thereby increasing annotation costs. In addition, the small water body size and high fish density in intensive farming settings further complicate data labeling, leading to insufficient labeled samples for training detection models.

To effectively address the complex challenges of underwater fish species identification with limited annotated data, recent scholarly efforts have increasingly turned to FSL paradigms, propelled by significant advancements in deep learning. These methodologies are designed to achieve efficient and accurate model training with only a minimal set of labeled examples. The core mechanism involves leveraging rich knowledge distilled from base classes with abundant data and applying sophisticated feature transfer and task-adaptive optimization strategies. This enables models to rapidly generalize and accurately detect novel categories, such as previously unseen or rare fish species, without requiring extensive new annotations.

2.2. Few-Shot Learning

FSL integrates feature transfer and meta-learning to ensure robust generalization of models when training data is scarce. The purpose of feature transfer is to exploit and reuse representations acquired from base classes in order to recognize new classes. FSL increases the model’s flexibility and adaptability, permitting accurate predictions even in cases with very limited labeled samples.

Zhang et al. [22] proposed the Auto-FSL model, which introduces an attribute consistency network for FSL and achieved performance superior to traditional methods across multiple datasets. Villon et al. [23] proposed an automatic underwater fish species classification method based on few-shot learning, aimed at addressing the performance bottleneck of traditional deep learning methods in data-scarce situations. The method uses the Reptile algorithm for meta-learning training and is capable of recognizing 20 species of coral reef fish with only 1 to 30 images per class. Zhai et al. [24] introduced the SACovaMNet model, which enhances feature extraction capabilities by incorporating a sandwich attention mechanism, showing excellent performance in few-shot fine-grained fish species classification tasks.

Building on the aforementioned achievements, this study explores a few-shot learning framework enhanced with visual feature augmentation, delivering a more efficient and practically viable solution for fish detection in challenging aquatic environments.

3. Materials and Methods

This study introduces a few-shot learning framework for fish detection with limited training samples. The proposed method is an SA-ResNet backbone, which enhances multi-scale feature extraction by incorporating grouped convolutions and a split-attention mechanism. This design enables the model to focus on informative spatial and channel features, which is crucial for handling complex underwater scenarios. To further improve generalization with limited data, a Variational Autoencoder (VAE) module is integrated to model the probabilistic distribution of support features, capturing intra-class variations and reducing overfitting. Finally, a CNN-based detection head performs bounding box regression and classification, allowing accurate localization and recognition of fish species even with few annotated examples.

3.1. SA-ResNet

The ResNet [25] model is the baseline backbone for training. To address the limitations of ResNet in multi-scale feature extraction, a feature enhancement strategy based on grouped convolutions and a hierarchical attention mechanism [26] is employed to construct the SA-ResNet module, as illustrated in Figure 1.

In the SA-ResNet module, the input feature map is first split into several subspaces along the channel dimension. Each subspace is then processed separately using light-weight convolutions, helping to retain local feature consistency while keeping the computation efficient. Within each subspace, a split-attention mechanism is applied to adaptively weigh channel responses—highlighting informative features and suppressing less relevant ones. Once attention is applied, all enhanced subspaces are concatenated back along the channel axis to form a unified feature map. A following 1 × 1 convolution fuses these concatenated features and optionally reduces dimensionality. To maintain stable training and support deeper networks, a residual connection is introduced by adding the fused output back to the original input. Overall, the design allows SA-ResNet to capture rich spatial and channel-wise information, while keeping the module compact and train-able.

Compared to standard convolutional structures, the SA-ResNet architecture introduces key enhancements over standard convolutions, making it ideal for complex vision tasks. By dividing features into subspaces for independent attention processing, it achieves precise control over local variations. The split-attention mechanism also bolsters multi-scale feature extraction, proving particularly effective for detecting variably sized, irregular objects in challenging underwater environments where occlusion and noise are prevalent.

The inclusion of residual connections further reinforces the architecture by stabilizing gradient flow and enabling deeper model designs without risking performance degradation. Together, these components significantly enhance the expressive power and adaptability of the network in challenging environments.

3.2. Few-Shot Fish Detection Model

Building on the SA-ResNet module, this paper further develops a few-shot fish detection model designed to operate under conditions of limited labeled dataset. The proposed model aims to effectively detect new fish species with only a few-shot number of annotated samples, addressing the challenges posed by data scarcity in aquaculture environments. The architecture of the few-shot fish detection model is detailed in Figure 2.

The proposed detection model is composed of a dual-stream SA-ResNet backbone for feature extraction, a variational feature aggregation module to enhance support features, and a CNN-based detection head responsible for final prediction. The model processes two types of input: a support image set containing a large number of labeled samples from base classes and a query image ·requiring object detection. Both inputs are independently processed through the SA-ResNet backbone to extract feature representations that capture characteristic similarities among fish groups.

In the SA-ResNet backbone, the input feature maps are first partitioned into multiple subspaces along the channel dimension, where attention is applied independently to each subspace. This design enables the network to dynamically focus on both spatially and semantically informative features, enhancing its ability to extract fine-grained patterns. After attention refinement, the outputs from all subspaces are concatenated and fused via a 1 × 1 convolution. The residual connection integrates the fused features with the original input, ensuring continuous information flow and promoting stable gradient propagation throughout the training process. Conventional feature extractors struggle with complex backgrounds and visual noise; however, the enhanced backbone demonstrates notable effectiveness in underwater detection tasks.

To further improve feature generalization, particularly under conditions of limited labeled data, a Variational Autoencoder (VAE) module is integrated to process the support set. The VAE architecture, consisting of an encoder, latent sampler, and decoder, learns a probabilistic distribution over support features, thereby capturing intra-class variability across individual fish instances. By modeling such variability, the VAE not only preserves essential feature distinctions but also mitigates overfitting to scarce samples. Consequently, the probabilistic representations generated by the VAE promote the learning of more robust and generalizable features, which in turn strengthens the model’s adaptability to novel fish species with minimal annotation.

In the final stage, the refined support and query features are fused via concatenation and passed into a CNN-based detection head. This module jointly carries out bounding box regression and classification, thus allowing the model to precisely localize and recognize fish targets, even under few-shot learning constraints. By integrating attention, generative modeling, and detection in a modular fashion, the framework provides a unified solution tailored for fine-grained recognition tasks under complex underwater conditions.

In conclusion, the proposed framework presents a systematic and effective solution for few-shot underwater fish detection. The innovations are cohesively integrated within the MMFewShot pipeline: the SA-ResNet backbone serves as a powerful feature extractor adept at capturing multi-scale characteristics; the VAE-based feature enhancer generalizes support set representations to mitigate overfitting; and the entire dual-stream meta-learning architecture is seamlessly implemented as a query-support detector. This modular design not only validates the efficacy of our components in handling data scarcity and complex environments but also ensures reproducibility and provides a solid foundation for future research in few-shot visual detection within standardized benchmarking frameworks.

3.3. Experimental Setup

The experiments were executed on a server with the following configuration: a CPU (12th Gen Intel(R) Core (TM) i7-12600KF), four RAM modules (32 GB), and a GPU (NVID-IA GeForce RTX 4090, NVIDIA, 24 GB). PyTorch 3.6 was used as the deep learning model for this experiment. In Table 1, we list the key experimental hyperparameters used in this study.

3.4. Dataset

The open-source dataset used in this study is Fish4Knowledge [27], which includes 23 fish species and a total of 27,370 images. The dataset consists of a series of video images of marine fish, covering various fish species in different aquatic environmental conditions. It provides researchers with a standardized and fully annotated data source for training and testing various fish classification and detection models.

In addition, this study constructed a customized dataset based on the RAS platform, as shown in Figure 3. The data were collected from an indoor recirculating aquaculture pond with a radius of approximately 1.5 m and a water depth of about 1.2 m. A Canon 5DS single-lens reflex camera was employed to capture images at a resolution of 2560 × 1440. To simulate real aquaculture conditions, image acquisition was conducted under a mix of natural indoor lighting and artificial supplementary light. The illumination conditions were not strictly controlled, so the dataset incorporates natural variations in lighting—such as changes in brightness and color temperature—caused by weather, diurnal cycles, and water surface reflections. The camera was fixed 1.5 m above the pond. For image annotation and quality control, bounding boxes were annotated using the LabelImg tool. After annotation, the mean Intersection over Union (mIoU) variance of the dataset was calculated. An mIoU > 0.95 for all annotations indicated high consistency in the position and size of the bounding boxes. The two fish species in this study were African tilapia (Oreochromis niloticus, black in color, hereinafter referred to as black fish) with a farming cycle of three months and an average body length of 10 cm, and rainbow snapper (red tilapia, red in color, hereinafter referred to as red fish) with a farming cycle of five months and an average body length of 12 cm. The self-constructed dataset was used to evaluate the few-shot learning performance, with the number of images categorized by different shot settings. During validation, individual or schooled fish samples from different images could be selected. The specific details are provided in Table 2, ensuring the reproducibility of the experiments.

Figure 4 illustrates representative samples from the customized dataset, consisting of red-fish and black-fish data. To ensure that the model was not constrained to the recognition of a single species, two distinct fish groups from different cultivation batches within the same pond were employed. This design enables the dataset to capture inter-group variability under otherwise consistent environmental conditions, thereby providing a more stringent test of the model’s ability to generalize across visually distinct yet related fish populations. During data collection, water turbidity remained low, which guaranteed clear visibility of fish contours. To account for natural variability, images were acquired from diverse viewing angles, thereby including changes in posture, partial occlusion, and background. While illumination was neither strictly controlled nor documented, the random sampling across acquisition sessions naturally introduced minor variations in lighting. The collected samples ultimately served as representations of novel fish species absent from the base training phase, thereby enabling a rigorous evaluation of the model’s adaptability to unseen species under few-shot detection.

The customized dataset constructed in this study serves as the novel class set. In the context of few-shot learning, a “shot” refers to the number of labeled examples available for each class during training [28]. For instance, a 1-shot setting uses only one labeled image per class. The customized dataset in this study contains a limited number of annotated images per fish species to better reflect the practical challenges of data scarcity commonly encountered in aquaculture operations.

In summary, the customized dataset offers a realistic few-shot evaluation setting that reflects the data scarcity and variability of practical aquaculture scenarios. On this basis, all images, including those from basic data sources such as Fish4Knowledge and the customized dataset, were preprocessed using a consistent pipeline. The original high-resolution images (2560 × 1440) were resized to 224 × 224 pixels to serve as the direct input resolution for our model. This input size aligns with the standard input dimensions of the ResNet-101 architecture, which serves as our feature extraction backbone, and ensures compatibility with common preprocessing practices while maintaining computational efficiency. The following ablation experiments further examine the effectiveness of the model components under this configuration.

3.5. Model Training Framework

In response to the challenges of fish detection in complex underwater environments, this paper proposes a novel detection model grounded in a systematic critique of existing approaches. While computationally straightforward, conventional single-feature binary classification methods exhibit sensitivity to environmental variations (e.g., lighting, water turbidity), possess limited discriminative power, and consequently demonstrate poor generalization in fine-grained classification and complex scenes. Although metric-based meta-learning shows promise for few-shot adaptation, it often prioritizes inter-class similarity measurement over the learning of stable, discriminative features intrinsic to the species. Furthermore, its performance is highly dependent on the meta-training task distribution and can degrade significantly under domain shift (e.g., transitioning from clear to turbid water). The model introduced in this work is thus designed to enhance the localization and robust representation of fundamental fish characteristics while maintaining learning efficiency, aiming to achieve more reliable and accurate cross-scene fish recognition under limited sample conditions.

To rigorously assess the efficacy of the proposed few-shot learning detection framework, a structured task construction methodology is employed. Specifically, for each training or evaluation episode, N distinct categories are randomly sampled from the dataset. For each of these N categories, K annotated instances are selected to constitute the support set, which provides the foundational knowledge for the model to learn the new concepts. Concurrently, a separate query set is constructed by sampling new instances (typically following K-shot settings such as 1, 5, or 10) from the same N categories. This query set is exclusively used to evaluate the model’s proficiency in generalizing the knowledge acquired from the support set. This study takes the 23 types of fish school dataset images in the public dataset as the base class, and two types of customized datasets as new classes. Each training session extracts images containing X-shot images as the training set, and then selects 50 images from each category for the query set.

Leveraging the limited samples within the support set, the model is designed to rapidly assimilate the defining characteristics of the novel categories. Through gradient-based optimization, the model’s parameters are iteratively refined based on the support set, enabling it to adapt effectively to the novel task. The discrepancy between the model’s predictions and the ground-truth annotations on the query set is quantified as a loss function, which is minimized during the training phase to update the model parameters. During the testing phase, this same discrepancy serves as a metric for evaluating the model’s accuracy. A critical aspect of this paradigm is that the categories encountered in each episode are distinct. By exposing the model to a vast and diverse array of such meta-learning tasks, it cultivates a generalized ability to swiftly comprehend and discriminate between new categories, moving beyond mere memorization of specific dataset labels.

The overall training regimen follows a two-stage paradigm, as illustrated in Figure 5. The initial stage, termed base class pre-training, involves training the model on large-scale public datasets. This phase aims to yield a base detection model proficient in recognizing common fish species. Subsequently, the second stage incorporates a custom-built dataset containing novel fish species. Employing the same N-way K-shot episodic strategy, the model undergoes meta-learning fine-tuning. The objective of this stage is to equip the model with the ability to rapidly adapt to novel classes based on very few examples. This structured approach ensures the model learns a generalizable strategy for few-shot adaptation rather than overfitting to the base classes. This study utilized the 23 categories from the Fish4Knowledge dataset for conventional object detection. The dataset was partitioned into a support set and a query set at an approximate ratio of 8:2. During the meta-training phase, episodes were constructed by randomly selecting N classes of fish data from the support set for training. Validation was performed every 2000 iterations (for a total of 18,000 iterations). Each validation step involved sampling N classes, comprising W unlabeled images, from the query set—ensuring these classes were distinct from those used in the training episode—to conduct object detection validation. This process continuously optimized the base detection model. In the second stage, the custom dataset and its annotations were incorporated into the support set. Different numbers of images were added according to the specific k-shot requirements. This stage involved 2000 training iterations, with validation performed every 500 iterations by comparing predictions against ground-truth labels to assess accuracy, ultimately yielding the final meta-learned detection model.

In conclusion, this research implements a meticulously structured and hierarchical few-shot object detection framework. Its core resides in the two-stage training paradigm, which commences with pre-training on a data-rich base class dataset to develop powerful general-purpose feature representations. This is followed by meta-learning fine-tuning on episodes constructed from novel classes, continuously challenging the model to learn the meta-skill of rapid adaptation using minimal support examples, guided by feedback from the query set. The ultimate objective is to endow the model with formidable generalization capabilities, enabling it to achieve accurate detection of novel fish species even when presented with only a handful of exemplars.

4. Results and Discussion

4.1. Model Comparison

While two-stage training methods like VFA enhance feature diversity by incorporating Variational Autoencoders (VAEs), they often rely on standard backbone networks (e.g., ResNet), which exhibit limitations in handling scale variations and explicitly modeling spatial relationships within complex scenes. In contrast, the core innovation of this work lies in the design of a novel backbone network named SA-ResNet. This network employs a spatial attention-aware mechanism to explicitly model multi-scale contextual information. Our approach can be orthogonally integrated with feature generation schemes like VFA. To validate the effectiveness of SA-ResNet, Table 3 outlines the following model variants. ✓ indicates correct results, while × indicates incorrect results. To ensure experimental fairness, the mean Average Precision (mAP) of a one-stage baseline model for base training trained on the same dataset serves as the evaluation standard.

4.2. Comparison of Experimental Parameters

Over the course of model training, as summarized in Table 4, the study systematically monitored several critical parameters to ensure a comprehensive evaluation of performance [29].

These metrics collectively offer insights into the model’s learning behavior, convergence stability, and detection accuracy, thereby serving as essential indicators for performance assessment throughout the training procedure. Specifically, Loss_rpn_cls and Loss_rpn_bbox are used to evaluate the classification accuracy and bounding box regression performance of the Region Proposal Network (RPN). A reduction in these losses implies that the RPN is generating more precise object proposals and achieving better localization performance, which directly influences the quality of the subsequent detection stage.

Furthermore, Loss_cls and Loss_bbox focus on the final detection stage, where they measure classification accuracy and localization precision. Lower values of these losses suggest that the model is not only correctly identifying object categories but also accurately predicting their spatial positions, thereby confirming the overall robustness and reliability of the detection framework.

Acc represents the overall classification accuracy across all test samples. To adapt to few-shot learning tasks, Loss_meta_cls and Meta_acc are introduced to measure the meta-classification loss and meta-level accuracy, respectively. Lower Loss_meta_cls values and higher Meta_acc scores suggest better recognition of novel classes with limited labeled data. Loss_vae evaluates the data model capability of the Variational Autoencoder; a lower value indicates a more precise model oflatent feature distribution.

Finally, the overall loss aggregates all individual loss terms, providing a unified indicator of the model’s training performance. A lower total loss implies more effective learning across object detection, meta-learning, and latent feature model. These parameters collectively offer a detailed assessment of the model’s capability, particularly in few-shot learning scenarios.

4.3. RPN Loss

The RPN loss curve, as shown in Figure 6, tracks the performance of the RPN during training. It comprises the measurement of the object detection accuracy (Loss_rpn_cls) and the precision of bounding box localization (Loss_rpn_bbox) for each generated proposal.

The improved model exhibits a smoother and consistently lower loss throughout training, demonstrating superior performance in generating region proposals and predicting object locations. In contrast, the ResNet baseline shows larger fluctuations and slower convergence, particularly in localization tasks. The optimized model, however, maintains lower bounding box loss and achieves superior stability, faster convergence, and higher accuracy in localization. Additionally, the RPN classification loss shows quicker convergence and consistently lower values, reflecting more precise target region identification and a more efficient training process.

4.4. Detection Loss

As shown in Figure 7, the detection loss curve demonstrates that the proposed model converges more quickly and consistently achieves a lower loss compared to the baseline. The detection loss, which includes classification loss (Loss_cls) and bounding box regression loss (Loss_bbox), reflects the model’s ability to accurately classify objects. A lower detection loss indicates the proposed model improves the detection accuracy. Although the proposed model exhibits greater fluctuations in both classification and bounding box regression losses during the early stages of training, it ultimately achieves lower final loss values than the ResNet baseline. This demonstrates that, despite less stable convergence in the initial phases, the proposed model is better at capturing discriminative features, ultimately leading to superior localization and classification accuracy upon convergence.

4.5. Loss_meta_cls and Meta Accuracy Analysis

The meta loss and Meta Accuracy curves provide insights into the model’s performance in few-shot learning tasks. Meta loss (Loss_meta_cls) reflects the model’s ability to classify novel classes with limited labeled data, while Meta Accuracy (Meta_acc) measures the accuracy of the model’s predictions on these novel classes. A lower meta loss and higher Meta Accuracy indicate that the model is more effective at generalizing to new classes with minimal supervision. As shown in Figure 8, the comparison of Loss_meta_cls and Meta_acc reveals the superior performance of the improved model in few-shot classification tasks.

The Loss_meta_cls curve demonstrates that the proposed model consistently achieves a lower meta-classification loss compared to ResNet. During the early stages of training, ResNet shows a faster decline in loss, but the proposed model quickly surpasses it in convergence and stabilizes at a much lower value; this demonstrates that the proposed model adapts more efficiently to new classes, reducing classification errors even with limited labeled samples. Similarly, the Meta Accuracy curve reveals that the proposed model converges faster and reaches a significantly higher accuracy than ResNet, particularly in the later stages of training. While ResNet accuracy gradually increases, the proposed model rapidly improves and stabilizes at a higher level. This indicates its superior ability to generalize to new tasks. The disparity grows increasingly pronounced as the model undergoes further iterations, empirically validating the proposed framework’s robustness and its superior alignment with the demands of few-shot learning environments.

4.6. VAE Loss

As shown in Figure 9, the VAE loss curve (loss_vae) reflects the model’s performance in enhancing feature representations through the Variational Autoencoder. The proposed model exhibits a rapid decrease in VAE loss, which stabilizes after a short period, demonstrating that the model effectively learns the latent feature representations. In contrast, the ResNet baseline exhibits larger fluctuations in the early training stages and fails to significantly reduce the VAE loss, suggesting slower convergence and less efficient learning of the latent feature space. This difference highlights the proposed model’s superior efficiency in feature learning, ultimately leading to more effective generalization in tasks with limited labeled data.

4.7. Performance and Convergence Analysis

To comprehensively evaluate the model’s overall performance, we track both the total loss and Overall Accuracy during training. The total loss curve (Loss) reflects the combined loss from all model components, offering a holistic view of the model’s convergence. A lower total loss indicates improved performance across all tasks.

As shown in Figure 10, the Overall Accuracy and loss curves demonstrate the superior performance of the proposed model compared to ResNet. In the Accuracy curve, the proposed model consistently achieves higher accuracy. Although ResNet exhibits better accuracy in the early stages, the proposed model quickly surpasses it and stabilizes at a higher accuracy level. This reflects the model’s more efficient learning and faster convergence.

Similarly, in the loss curve, the proposed model shows a steady decline in loss, ultimately reaching a lower and more stable value compared to ResNet. While ResNet experiences fluctuations and struggles to reduce loss effectively in later iterations, the proposed model converges more smoothly, demonstrating better optimization in both classification and localization tasks. These results emphasize the model’s robustness and superior generalization ability. In addition, this experiment adopts the default evaluation settings of the MMFewShot framework, where the IoU threshold for determining correct detection boxes is 0.5 (mAP@0.5), and the AP calculation for each category is based on the area under the P-R curve. The final mAP value is obtained by averaging across all categories. The study uses the mAP as a comprehensive evaluation metric, as shown in Table 5; to ensure the reproducibility of the experiment, this study repeated key experiments (especially with a 10-shot set) using five different random seeds. The relevant results have now been expressed as “mean ± standard deviation”, and t-tests have been conducted, indicating that certain improvements are statistically significant at the p < 0.05 level. In addition, in order to ensure the effectiveness of the experimental results, this experiment also counted the accuracy, recall, F1 value and other indicators to observe the optimization effect of the model in many aspects, so as to perform a more comprehensive evaluation and make it easier to compare it with the cited and other related work.

As the mAP value is the core indicator in target detection and is closely related to other indicators, it calculates the average value of all categories of AP to reflect the overall detection performance of the model on multiple categories. Therefore, the study focuses on the comparison of map values. The mAP comparison results of two fish groups, red fish and black fish, show that the proposed model consistently outperforms the VFA method. For red fish, the proposed model achieves a mAP of 0.775 in base training and 0.265 in 10-shot fine-tuning, slightly higher than VFA’s 0.763 and 0.258. For black fish, the improvements are more evident, with the proposed model reaching 0.833 and 0.286, compared to VFA’s 0.804 and 0.271. These results demonstrate the proposed model’s better generalization and adaptability under few-shot settings.

To evaluate the performance of different models in fish detection tasks, we conducted a series of experiments using the following models: TFA [30], Meta-RCNN [31], VFA, and the module improved in this study. Additionally, we explored the model’s performance under different few-shot learning conditions (1-shot, 5-shot, 10-shot). These experiments aim to assess the models’ ability to adapt to new classes with limited labeled data, simulating the challenge of having only a few samples in real-world scenarios.

This study compares the backbone networks by comparing their training accuracy on the same dataset, as shown in Table 6. Bold values indicate the best performance among the compared methods.

The proposed model consistently outperforms the other approaches across all fine-tuning conditions for both red-fish and black-fish detection tasks. For red fish, under the 10-shot fine-tuning setting, the proposed model achieves a mAP of 0.265, exceeding VFA (0.258), Meta-RCNN (0.224), and TFA (0.125). Similarly, for black fish in the same setting, the proposed model attains a mAP of 0.286, surpassing VFA (0.271), Meta-RCNN (0.244), and TFA (0.129). The performance advantage is particularly pronounced in the low-shot scenarios. In the 1-shot condition, the proposed model records mAPs of 0.152 for red fish and 0.169 for black fish, both of which represent notable improvements over competing methods, indicating superior rapid learning and adaptability to novel categories. In the 5-shot setting, the proposed model continues to outperform, reaching mAP values of 0.247 for red fish and 0.253 for black fish, which further confirms its effectiveness in few-shot detection scenarios.

Based on the provided mAP values, the object detection process for novel-class fish targets under low-shot settings exhibits significantly low mAP, exemplified by a value of 0.152 for the VFA method under the 1-shot condition. This study identifies three primary contributing factors. Firstly, inadequate feature discriminability arises from the limited number of training samples, hindering the model’s ability to learn subtle yet distinguishing features, which consequently leads to a high incidence of both false positives and false negatives. This limitation is further corroborated by a substantial discrepancy between the predicted bounding boxes and the ground-truth annotations, with the area of predicted boxes being approximately only one-tenth of the actual annotations. Secondly, a pronounced domain shift is evident. The annotated dataset and the public dataset used for base-class learning were acquired under different lighting conditions (in-air vs. underwater), potentially impairing the model’s adaptability to the novel domain. Thirdly, overfitting to the support set is observed. The model demonstrates good performance on the limited support samples but fails to generalize effectively to the query set, as indicated by a lower recall rate. From a statistical perspective, the low mAP indicates that the detection proposals exhibit low precision across all recall levels, implying that a considerable number of predictions are either incorrect or associated with low confidence. Despite the modest absolute performance, reporting this result remains critically valuable. It establishes a rigorous performance baseline for a highly challenging task, quantitatively characterizing the difficulty inherent in the combined constraints of “low-shot” learning and “domain difference.” Furthermore, it clearly delineates the limitations of current methodologies, thereby providing a clear benchmark for comparison and directing meaningful pathways for future research and improvement.

These results demonstrate that the proposed model can effectively detect new categories with very limited samples while maintaining high detection accuracy. Compared to traditional models, the proposed approach adapts better to new fish species, reduces misclassifications, and exhibits strong generalization and learning efficiency, which also reflects the robustness of the model during optimization, especially when new data samples are scarce, as it effectively prevents overfitting while continuing to improve.

Table 7 shows the results when treating shot as a hyperparameter, tested with shot ∈ {1, 5, 10}. For each setting, we recorded the detection accuracy (mAP) and the average time per training epoch. The results clearly show that increasing the number of shots improves detection accuracy: mAP rises from 0.16 for a 1-shot to 0.275 for a 10-shot model. Although the 10-shot model requires more time per epoch (0.03) compared to the 1-shot (0.012) and 5-shot (0.021) models, the accuracy improvement is significant. It indicates that the additional training time is worthwhile, as it delivers the best balance between performance and efficiency in our experiments. Based on these results, we selected shot = 10 for the remaining experiments.

In summary, increasing K improves validation mAP but incurs higher time/epoch (and label cost): mAP rises from 0.160 (k = 1) to 0.275 (K = 10), while time/epoch increases from 0.012 to 0.030. The marginal mAP gain per additional labeled image exhibits diminishing returns, Thus, K = 10 achieves the highest accuracy (0.275), but K = 5 attains ~91% of the mAP of K = 10 (0.250/0.275) at ~70% of the training time per epoch (0.021/0.030) and 50% of the labels per class, representing a strong Pareto point when resources are constrained.In the main experiments, we adopt K = 10 to report the best attainable accuracy under our setting. For deployments with tight labeling or compute budgets, K = 5 is recommended as a balanced choice.

Following the ablation study, the study additionally provides a visual comparison between predicted bounding boxes and ground-truth annotations on single test images, as Table 8 shows, thereby offering a clear and intuitive demonstration of the model’s capability in both object counting and localization precision.

To further investigate the model’s performance in complex underwater environments, three representative cases of failure or suboptimal detection from the validation set were selected (see Figure 11a–c). The main issues include partial occlusion, instance merging that leads to undercounting, and missed detections caused by turbid water containing settled excreta.

These examples indicate that although the proposed model generally performs robustly, certain limitations remain under extreme lighting, heavy occlusion, and highly turbid water with settled excreta. Future work may focus on enhancing data augmentation strategies and feature extraction mechanisms to improve robustness and generalization in such challenging scenarios.

Based on the research content outlined, this study proposes a targeted technical framework for few-shot object detection and deploys it within an operational RAS, as illustrated in Figure 12. Empirical validation was conducted across successive breeding cycles, demonstrating the method’s efficacy in accurately distinguishing between different categories of autotrophic fish species (e.g., red fish and black fish). In addition, the model can not only recognize these two types of fish schools but also has certain detection effects for multi-class object detection in subsequent research. Methods based on small-sample learning aim to learn general features for learning about fish schools and improving robustness against environmental interference factors. The proposed approach effectively addresses the limitations of conventional single-class dataset recognition, which typically demands large-scale annotated data and exhibits poor robustness.

5. Conclusions

The proposed model achieves superior performance in few-shot learning scenarios, outperforming state-of-the-art approaches with respect to task adaptation and training efficiency. When confronted with the challenge of detecting novel fish species categories, the proposed framework exhibits remarkable robustness. It significantly diminishes the temporal expenditure associated with dataset construction and substantially enhances the overall efficacy of the target detection process.

This study presents a few-shot fish detection approach that combines visual feature augmentation with a split-attention-enhanced SA-ResNet backbone. The proposed model effectively improves feature extraction for local variations and enhances generalization in complex aquatic environments. Experimental results validate its ability to maintain high detection accuracy under limited data conditions, offering a practical and adaptable solution for real-world aquaculture applications where high-quality annotated datasets are scarce.

Experimental results confirm that the proposed model achieves consistent improvements in detection accuracy under data-limited conditions. In the base training phase, it improves average accuracy by 1.6% over the VFA method, and in the 10-shot fine-tuning setting, it achieves a 2.7% gain. This framework integrates a few-shot learning strategy with object detection, leveraging a VAE for feature fusion and incorporating an attention mechanism, which enables robust model performance under varying lighting conditions and complex backgrounds. Notably, in the 1-shot and 5-shot settings, the model outperforms VFA and Meta-RCNN by up to 6%, demonstrating adaptability to new fish species.

Looking ahead, future research could further optimize the model to perform more consistently in more complex and dynamic aquatic environments. While the model in this study already performs well in various few-shot tasks, challenges remain, especially under extreme lighting conditions, cluttered backgrounds, and strong underwater interference. Future research could focus on improving data augmentation techniques and feature extraction strategies to enhance the model’s robustness and adaptability. Moreover, expanding the training dataset to include more fish species and different environmental conditions would further improve the model’s application in real-world aquaculture, especially in new and changing environments.

Author Contributions

Conceptualization, S.Z. and W.D.; methodology, S.Z. and W.D.; validation, S.Z.; formal analysis, S.Z. investigation, D.Z.; resources, D.Z.; data curation, S.Z.; writing—original draft preparation, S.Z. and W.D.; writing—review and editing, S.Z. and W.D.; visualization, S.Z.; supervision, E.L.; project administration, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research and Development Plan of Wuhan (Grant No. 2023010402010589).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, D.; Xie, Z.; Deng, W.; Lu, E.; Zhang, S.; Gao, H. A physics-informed hybrid neural network for interpretable and accurate fish growth prediction in recirculating aquaculture systems. Comput. Electron. Agric. 2025, 239, 110914. [Google Scholar] [CrossRef]
Thai, H.-D.; Liu, Y.; Le, N.-B.-V.; Lee, D.; Huh, J.-H. Improved Attendance Tracking System for Coffee Farm Workers Applying Computer Vision. Appl. Sci. 2026, 16, 319. [Google Scholar] [CrossRef]
Zhao, E.; Qiu, C.; Zhang, C. Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine. Appl. Sci. 2026, 16, 354. [Google Scholar] [CrossRef]
Höglinger, G.U.; Adler, C.H.; Berg, D.; Oertel, W.H.; Schapira, A.H.V.; Bhatia, R.; Friedman, J.H.; Litvan, I.; Iranzo, A.; Espay, A.J. A biological classification of Parkinson’s disease: The SynNeurGe research diagnostic criteria. Lancet Neurol. 2024, 23, 191–204. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Ji, J.; Huang, C. Student classroom behavior recognition based on OpenPose and deep learning. In Proceedings of the 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Hu, L.; Wu, W. Enhancing few-shot image classification with a multi-faceted self-supervised and contrastive learning approach. IEEE Access 2024, 12, 164844–164861. [Google Scholar] [CrossRef]
Han, J.; Ren, Y.; Ding, J.; Li, L.; Xia, G.-S. Few-shot object detection via variational feature aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5149–5169. [Google Scholar] [CrossRef]
Abbasi, M.; Váz, P.; Silva, J.; Cardoso, F.; Sá, F.; Martins, P. Machine Learning-Enhanced Database Cache Management: A Comprehensive Performance Analysis and Comparison of Predictive Replacement Policies. Appl. Sci. 2026, 16, 666. [Google Scholar] [CrossRef]
Hamerly, G.; Elkan, C. Learning the k in k-means. In Proceedings of the 16th International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
Tolles, J.; Meurer, W.J. Logistic Regression: Relating Patient Characteristics to Outcomes. JAMA 2016, 316, 533–534. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
Vijayalakshmi, M.; Sasithradevi, A. AquaYOLO: Advanced YOLO-based fish detection for optimized aquaculture pond monitoring. Sci. Rep. 2025, 15, 6151. [Google Scholar] [CrossRef]
Sun, H.; Yue, A.; Wu, W.; Yang, C.; Zhang, X. Enhanced marine fish small sample image recognition with RVFL in Faster R-CNN model. Aquaculture 2025, 595, 741516. [Google Scholar] [CrossRef]
Nathasarma, R.; Mazumdar, B.; Roy, B.K. YOLOv8-based real time fish detection and classification using indigenous Bangladesh fish and its comparison. In Proceedings of the IEEE Silchar Subsection Conference (SILCON), Silchar, India, 21–22 November 2024. [Google Scholar] [CrossRef]
Tong, C.; Li, B.; Wu, J.; Xu, X. Developing a Dead Fish Recognition Model Based on an Improved YOLOv5s Model. Appl. Sci. 2025, 15, 3463. [Google Scholar] [CrossRef]
Wang, J.; Gong, Y.; Deng, W.; Li, J.; Chen, S. Fish detection in fishways for hydropower stations using bidirectional cross-scale feature fusion. Appl. Sci. 2025, 15, 2743. [Google Scholar] [CrossRef]
Lu, J.; Zhang, S.; Zhao, S.; Li, Y.; Wang, H. A metric-based few-shot learning method for fish species identification with limited samples. Animals 2024, 14, 755. [Google Scholar] [CrossRef]
Zhang, L.; Wang, S.; Chang, X.; Liu, Y.; Ge, Z. Auto-FSL: Searching the attribute consistent network for few-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1213–1223. [Google Scholar] [CrossRef]
Villon, S.; Iovan, C.; Mangeas, M.; Mouillot, N.; Villéger, D. Automatic underwater fish species classification with limited data using few-shot learning. Ecol. Inform. 2021, 63, 101320. [Google Scholar] [CrossRef]
Zhai, J.; Han, L.; Xiao, Y.; Xu, W.; Zhao, J. Few-shot fine-grained fish species classification via sandwich attention CovaMNet. Front. Mar. Sci. 2023, 10, 1149186. [Google Scholar] [CrossRef]
Targ, S.; Almeida, D.; Lyman, K. ResNet in ResNet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Sun, Z.; He, T.; Mueller, J.; Manmatha, R.; Li, M.; et al. ResNeSt: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Fisher, R.B.; Chen-Burger, Y.H.; Giordano, D.; Hardman, L.; Lin, F.P. Fish4Knowledge: Collecting and Analyzing Massive Coral Reef Fish Video Data; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1, pp. 1–10. Available online: https://link.springer.com/book/10.1007/978-3-319-30208-9 (accessed on 13 January 2026).
Chui, K.T.; Gupta, B.B.; Lee, L.K.; Choo, K.K.R.; Atiquzzaman, M. Analysis of N-way K-shot malware detection using few-shot learning. In Proceedings of the International Conference on Cyber Security, Privacy and Networking (ICSPN), Bangkok, Thailand, 19–21 November 2021; Springer: Cham, Switzerland, 2021; pp. 33–44. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Wang, X.; Huang, T.-E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020. [Google Scholar] [CrossRef]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta R-CNN: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed SA-ResNet module.

Figure 2. The architecture of the few-shot fish detection model.

Figure 3. Experimental platform of RAS.

Figure 4. Experimental customized dataset of RAS.

Figure 5. Architecture of the proposed model training framework.

Figure 6. The comparison of RPN loss. (a) Comparison of loss_rpn_cls, (b) comparison of loss_rpn_bbox.

Figure 7. The comparison of detection loss. (a) Comparison of loss_cls, (b) comparison of loss_bbox.

Figure 8. The comparison of meta loss. (a) Comparison of loss_meta_cls, (b) comparison of meta_acc.

Figure 9. Comparison of VAE loss.

Figure 10. Comparison of overall accuracy and total loss. (a) Comparison of acc, (b) comparison of loss.

Figure 11. Failure cases of the proposed model in complex environments. (a) Overlapping fish cause partial occlusion, leading the detector to miss entire fish, (b) Due to weak/ambiguous boundaries and close proximity or overlap among individuals, the detector merges multiple fish into a single detection, leading to an underestimated count. (c) The water is highly turbid and contains settled excreta, which significantly increases background noise and results in missed detections.

Figure 12. The few-shot technical framework within an operational RAS.

Table 1. Experimental parameters.

Parameter	Value/Description
Optimizer	AdamW
Initial Learning Rate	0.0002
Learning Rate Schedule	Cosine Annealing /Linear Warmup
Minimum Learning Rate	1 × 10⁻⁶
Warmup Iterations	500
Warmup Start Ratio	0.001
Weight Decay	0.05
Max Detections per Image	100
Detection Score Threshold	0.5
Total Training Iterations	18,000
Batch Size per GPU	4
Data Loader Workers per GPU	2
Evaluation Interval	3000 iterations
Evaluation Metric	mAP (mean Average Precision)

Table 2. Collection and division of customized dataset.

Shot	Images of per Category	Objects in Each Type Image	Query Image Quantity per Task
1-shot	10	1–2	1
5-shot	10	5–6	5
10-shot	10	10	10

Table 3. The comparison of the model set.

Model Variants	Backbone	VAE	SA-ResNet	mAP
Baseline	ResNet	✗	✗	0.612
Baseline + VAE	ResNet	✓	✗	0.728
SA-ResNet	SA-ResNet	✗	✓	0.724
SA-ResNet + VAE	SA-ResNet	✓	✓	0.775

Table 4. Evaluation metrics for few-shot fish detection.

Groups	Parameter	Function
RPN Loss	Loss_rpn_cls	RPN classification loss, evaluates object prediction for proposals
RPN Loss	Loss_rpn_bbox	RPN bounding box regression loss, evaluates proposal localization accuracy
Detection Loss	Loss_cls	Final classification loss, evaluates predicted class label accuracy
Detection Loss	Loss_bbox	Final bounding box regression loss, evaluates detections
Loss Meta Cls and Meta Accuracy	Loss_meta_cls	Meta-classification loss, evaluates performance on novel class recognition
Loss Meta Cls and Meta Accuracy	Meta_acc	Meta-level classification accuracy on novel classes
VAE Loss	Loss_vae	VAE reconstruction loss, evaluates quality of latent feature model
Performance and Convergence Analysis	Loss	Overall total loss combining all sub-losses
Performance and Convergence Analysis	Acc	Overall classification accuracy across all test samples

Table 5. Comparison of mAP scores between VFA and the proposed model.

(a) Red-Fish Comparison of Parameters
Method	Parameter	VFA	Proposed Model
Base-train	mAP	0.763	0.775 ± 0.023
	Precision	0.82	0.781 ± 0.020
	Recall	0.73	0.742 ± 0.021
	F1	0.748	0.761 ± 0.022
Fine-tuning (10-shot)	mAP	0.258	0.265 ± 0.008
	Precision	0.458	0.268 ± 0.008
	Recall	0.225	0.238 ± 0.007
	F1	0.240	0.252 ± 0.007
(b) Black-Fish Comparison of Parameters
Base-train	mAP	0.804	0.833 ± 0.018
	Precision	0.808	0.846 ± 0.02
	Recall	0.77	0.801 ± 0.019
	F1	0.789	0.825 ± 0.022
Fine-tuning (10-shot)	mAP	0.271	0.286 ± 0.009
	Precision	0.282	0.305 ± 0.010
	Recall	0.25	0.267 ± 0.010
	F1	0.265	0.284 ± 0.010

Table 6. Comparison of model performance in few-shot fish detection.

(a) Black-Fish Comparison of mAP
Method	TFA	Meta-RCNN	VFA	Proposed
Base-train	0.615	0.724	0.763	0.775
Fine-tuning (1-shot)	0.11	0.174	0.152	0.169
Fine-tuning (5-shot)	0.123	0.183	0.247	0.253
Fine-tuning (10-shot)	0.129	0.244	0.271	0.286
(b) Red-Fish Comparison of mAP
Base-train	0.615	0.724	0.763	0.833
Fine-tuning (1-shot)	0.09	0.157	0.143	0.152
Fine-tuning (5-shot)	0.102	0.183	0.234	0.247
Fine-tuning (10-shot)	0.125	0.224	0.258	0.265

Table 7. Effect of few-shot model on detection accuracy and training time.

Shot	Support img/cls	Time/Epoch	mAP/cls
1	1	0.012	0.16
5	5	0.021	0.25
10	10	0.03	0.275

Table 8. Single-image test evaluation, illustrating model performance in object counting and localization.

Shot	Predicted Boxes	Ground-Truth Boxes	Acc
1	4	10	0.4
5	7	10	0.7
10	9	10	0.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, D.; Zhang, S.; Deng, W.; Lu, E.; Xie, Z. A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation. Appl. Sci. 2026, 16, 2441. https://doi.org/10.3390/app16052441

AMA Style

Zhang D, Zhang S, Deng W, Lu E, Xie Z. A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation. Applied Sciences. 2026; 16(5):2441. https://doi.org/10.3390/app16052441

Chicago/Turabian Style

Zhang, Daode, Shihao Zhang, Wupeng Deng, Enshun Lu, and Zhiwei Xie. 2026. "A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation" Applied Sciences 16, no. 5: 2441. https://doi.org/10.3390/app16052441

APA Style

Zhang, D., Zhang, S., Deng, W., Lu, E., & Xie, Z. (2026). A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation. Applied Sciences, 16(5), 2441. https://doi.org/10.3390/app16052441

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Few-Shot Fish Detection Method with Limited Samples Using Visual Feature Augmentation

Abstract

1. Introduction

2. Related Work

2.1. Fish Detection

2.2. Few-Shot Learning

3. Materials and Methods

3.1. SA-ResNet

3.2. Few-Shot Fish Detection Model

3.3. Experimental Setup

3.4. Dataset

3.5. Model Training Framework

4. Results and Discussion

4.1. Model Comparison

4.2. Comparison of Experimental Parameters

4.3. RPN Loss

4.4. Detection Loss

4.5. Loss_meta_cls and Meta Accuracy Analysis

4.6. VAE Loss

4.7. Performance and Convergence Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI